Toward a Standardized Format for ASCII Text Documents



A Working Paper of

The ICADD Subcommittee on Standardization of ASCII Text

Documents



Prepared at the Trace Research and Development Center

Gregg C. Vanderheiden, Ph.D.

Neal Ewers





Keywords: Document access, ASCII, text documents, standard,

print disabilities, alternate formats, braille





<B>DRAFT</B>





Table of Contents

<LIST>

1. The Need for A Standard Electronic Format for Electronic

ASCII Text Files                                           1



1.a.  The Need                                             1



1.b.  Current ASCII Format                                 1



1.c.  Requirements of the New ASCII Format                 1





2.  Overall Goal                                           3





3.  Formal versus Informal Documents                       3



3.a.  Type 1 -- Informal Documents (ICADD-0 Format)        3



3.b.  Type 2 -- Informal Documents (ICADD-8 Format)        3



3.c.  Type 3 -- Formal Documents (ICADD-22)                3





4.  Specific Goals                                         4





5. Constraints                                             4





6. Proposed Format for Type 1 Documents                    5





7.  Proposed Format for Type 2 Documents

(ICADD-8) Format                                           7



7.a. Tag Rationale                                         9





8.  Request for Input                                      9



</LIST>



Toward a Standardized Format for ASCII Text Documents



A Working Paper of

The ICADD Subcommittee on Standardization of ASCII Text

Documents



Prepared at the Trace Research and Development Center

Gregg C. Vanderheiden, Ph.D.

Neal Ewers





1. The Need for A Standard Electronic Format for Electronic

ASCII Text Files





1.a.  The Need



Individuals who are blind or who have other print

disabilities have difficulty in accessing and effectively

using documents in print form.  One approach to addressing

this is to provide the documents in electronic form.

Individuals using microcomputers and other electronic

reading aids can then access and have the information

presented to them in speech, braille, large text, or other

suitable form.  Because of the large number of different

formats in which electronic text can be stored, specifying

that a document must be in "electronic form" will not

necessarily result in an electronic document which can in

fact be accessed or read.  Some standard format which can be

read by all software is therefore necessary.



Unless a standard definition of an "ASCII text document" is

created, it will not be possible to create tools which can

easily work with these documents.  Further, it is difficult

to specify that people must provide their information as an

ASCII text file if no definition as to exactly what that

means is provided.





1.b.  Current ASCII Format



Currently, the most common format available is what might be

called an ASCII text file.  This is a file which contains

only standard ASCII text characters (Table 1).  To

accommodate foreign languages, this standard has been

revised by the International Standards Organization (ISO) as

shown in Table 2.



In either case, ASCII or ISO, the text file does not include

any formatting information.  Thus, any information that was

encoded in an original document by using boldface,

underlining, italics, footnote designations, etc., is lost

in a document that is changed into ASCII text form.  Since

the boldface, underlining, etc., may contain convey

important information, converting a document into a straight

ASCII file may in fact cause some important information to

be lost and therefore unavailable to the individual using

the ASCII text file.





1.c.  Requirements of the New ASCII Format



One requirement of standard ASCII text file format therefore

would be that it provide some mechanism for preserving

essential formatting information that might otherwise be

lost.



<pp>1</pp>



A second requirement is that the standard must clearly

define how the ASCII text file would be formatted.  For

example, is there a carriage return at the end of each line,

or only at the end of paragraphs?  (Documents with carriage

returns only at the end of paragraphs cause a problem for

some screen reading programs.)  If there is a carriage

return at the end of each line, how does one identify the

end of a paragraph, so that screen readers can read smoothly

across lines, but stop at the end of a paragraph?





Table 1: ASCII Characters



The ASCII value is listed to the left, and its corresponding

character to the right.



<LIST>

33   !

34   "

35   #

36   $

37   %

38   &

39   '

40   (

41   )

42   *

43   +

44   ,

45   -

46   .

47   /

48   0

49   1

50   2

51   3

52   4

53   5

54   6

55   7

56   8

57   9

58   :

59   ;

60   <

61   =

62   >

63   ?

64   @

65   A

66   B

67   C

68   D

69   E

70   F

71   G

72   H

73   I

74   J

75   K

76   L

77   M

78   N

79   O

80   P

81   Q

82   R

83   S

84   T

85   U

86   V

87   W

88   X

89   Y

90   Z

91   [

92   \

93   ]

94   ^

95   _

96   `

97   a

98   b

99   c

100  d

101  e

102  f

103  g

104  h

105  i

106  j

107  k

108  l

109  m

110  n

111  o

112  p

113  q

114  r

115  s

116  t

117  u

118  v

119  w

120  x

121  y

122  z

123  {

124  |

125  }

126  ~

127  

</LIST>





Table 2: ISO Characters



<other>Table 2 will go here</other>



<pp>2</pp>



2.  Overall Goal



The purpose of the ICADD ASCII Text Format Standard is to

provide a standard format for ASCII text documents.  This

effort to define a standard ASCII text format is a subset of

the overall goals of the International Committee for

Accessible Document Design (ICADD).  This group, which was

formed in 1992, has an overall scope of work which includes

both the development of a format for simple ASCII text

documents and the development of a standard for more formal

publications.  The standard for more formal publications is

not covered in this subcommittee report.





3.  Formal versus Informal Documents



Currently, the ICADD efforts cover three types of documents:

two informal and one formal.





3.a.  Type 1 -- Informal Documents <B>(ICADD-0 Format)</B>



With the proliferation of computers, there has been a

corresponding increase in the number of letters, memos, and

other informal written communication which are prepared

using word processors rather than typewriters.  This makes

it possible for a large amount of this material to be sent

to people as an ASCII text file when this is their

preference.  Type 1 documents include all of those informal

documents where there is no formatting (boldface, italics,

footnotes, etc.) which is necessary to understand the

documents (or where the loss of boldface, italics, etc.,

would not alter the reader's ability to understand the

document).  For this type of information, a very simple

ASCII Text Standard has been defined, and is described

below.  It includes no formatting information, and does not

support the use of boldface, underlining, etc., in a

document.





3.b.  Type 2 -- Informal Documents <B>(ICADD-8 Format)</B>



In addition to informal correspondence and documents, there

are also a number of other informal or semi-formal documents

and reports which are prepared using standard word

processors.  In these documents, however, formatting (such

as boldface, italic, underline, etc.) is often used to

convey important information in the document.  In addition,

these documents often contain footnotes, side-bars, or boxed

text which is interspersed with the running text of the

document.  Converting these documents into simple ASCII text

files (without preserving the formatting information) can

cause both confusion and loss of information.  Where text

formatting conveyed information, the information would be

lost.  When footnotes, boxed text, or side-bars suddenly

appear intermixed with running text (without any type of

marker), the resulting text file can be very confusing and

even misleading.  For these types of documents, a set of

eight tags is defined which allow users to mark common

attributes.  Specifically, these tags allow the user to mark

boldface, italicized, or other emphasized text, as well as

to mark list items, picture captions, side-bars or boxed

text, and page numbers.  .  This Type II document format is

referred to as <B>ICADD-8</B> and is described below.





3.c.  Type 3 -- Formal Documents (ICADD-22)



The third type of document defined by the ICADD effort is

formal documents, including books, journals, and other

formal publications.  Such documents can often contain

multiple sections or chapters as well as specially formatted

text.  In addition, these documents may also include

equations, tables, columns, and other specially formatted

information.  A set of 22 tags have been defined by ICADD to

allow these documents to be more effectively accessed and

read.  In addition, further specialized tag sets are being

explored to handle scientific, mathematical, and



<pp>3</pp>



other types of specially formatted text.  The purpose of

these tags is to allow special commercial document readers

to translate documents which are in the standard ICADD

format into documents that are structured for use in the

document reader.  The result is a document which can be

accessed and used by a person with a print disability in a

manner which is both complete (contains all of the

information in the original) and efficient (allows rapid

movement about and within the text).  Specifications for

Type 3 documents are provided in a separate document.





4.  Specific Goals



This document outlines the current draft of the ICADD

specifications for Type 1 and Type 2 informal documents.  It

is a first draft, and is being released so that persons with

print disabilities and others interested in this problem can

review it and offer input concerning the proposed

specifications.  This documents was prepared based upon

questionnaires answered by and conversations with members of

the ICADD subcommittee charged with arriving at the design

of ASCII formats.  Members of this committee include:



<LIST>



Jim Allan

Texas School for the Blind

1100 W. 45th Street

Austin, TX  78756

512/454-8631

Internet: jallan@tenet.edu



Charles Crawford, Commissioner

Executive Office of Human Services

Commission for the Blind

Boston, MA  02111-2227

617/727-5550



Judith Dixon

Consumer Relations Office

National Library Service for the Blind and Physically

Handicapped

Library of Congress

Washington, DC  20542

202/707-5100

Internet 74036.2101@Compuserve.com



Neal Ewers

Trace Research and Development Center

Room S-153, Waisman Center

1500 Highland Avenue

Madison, WI  53705

608/263-5485

fax 608/262-8848



John Hernandez

New York Institute for Special Education

9999 Pelham Parkway

Bronx, NY  10469

718/519-7000, extension 348

fax 718/231-9314



David Holladay

Raised Dot Computing

408 S. Baldwin Street

Madison, WI  53703

608/257-9595



Gregg Vanderheiden

Trace Research and Development Center

Room S-151 Waisman Center

1500 Highland Avenue

Madison, WI  53705

608/262-6966

Internet vanderhe@macc.wisc.edu

fax 608/262-8848



</LIST>





5. Constraints



This section contains a listing of the constraints which a

standard format in this area must meet.



<pp>4</pp>



a)     Any proposed guidelines must work easily on a wide

variety of computer platforms.



b)     The guidelines must be easy to implement, even on the

most rudimentary word processor.



c)     The guidelines should use terminology and strategies

which can be understood by any person responsible for

preparing documents in this format (secretaries, students,

etc.).



d)     Each level of format should be internally consistent

with the higher level formats (e.g., Type 2 must be

consistent with Type 3).





6. Proposed Format for Type 1 Documents



This section presents the proposed format for Type 1

documents <B>(ICADD-0</B> Format).  A summary of the format

rules is presented, following by a rationale for each of the

rules.



1)     Text should be broken up into lines with hard

carriage returns at the end of each line.



2)     Each line should be no longer than 78 characters.

(65 characters is preferable for documents which are short

and where short lines do not cause layout problems.)



3)     There should be two carriage returns at the end of

each paragraph.



4)     All titles within the document text should be

preceded by an extra carriage return (for a total of three

carriage returns) if they are not at the top of a page or

the document).



5)     All carriage returns should be followed by a line

feed character.



6)     Text in an ICADD-ASCII formatted document is limited

to printable ASCII characters with codes between 33 and 127,

plus Space (32), Tab (09), Carriage Return / Line Feed (13,

10) and Form Feed or New Page (12).  The basic characters

for 33 to 127 include (in order):

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = >

? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \

] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z

{ | } ~ 



<pp>5</pp>



1)     Text should be broken up into lines with hard

carriage returns at the end of each line.



<B>Rationale</B>: Some text readers are not able to scroll

past the end of a screen line.  Thus, hard carriage returns

at the end of each line are necessary in order to keep these

programs from crashing.



<B>Comment to Reviewers</B>:  The number of programs which

cannot handle text without carriage returns at the end of

each line is decreasing.  Some people felt that we might

lean into the future on this, and not specify carriage

returns at the end of every line.  This simplifies some

other document interpretation.  Most of the people we talked

to, however, felt that many individuals trying to access

these ASCII text files are not yet using the more

sophisticated tools, and that at least for the foreseeable

future it was better to stick with the hard carriage return

on each line format.  This is therefore included in the

current version of the format.  Additional comments, pro and

con, are invited.



2)     Each line should be no longer than 78 characters.



<B>Rationale</B>:  Using an 80-character line can cause some

computer displays to automatically word-wrap after the 80th

character.  If this is then followed by a carriage return,

it would result in all of the lines being double-spaced.  A

78-character limit eliminates this problem.  All modern

computers support an 80-character display.  Thus, adhering

to this format would result in documents which display

without distortion on any standard screen.  For printouts,

this would also fit in 6.5" at 10-point Courier, and thus

would print out on standard 8 1/2" x 11" paper with 1"

margins.  For documents which are short, and where short

lines will not create layout problems, a 65-character line

is more convenient for some users.

3)     There should be two carriage returns, with no spaces

(or other characters) between them, at the end of each

paragraph.



<B>Rationale</B>: More than one carriage return is needed in

order differentiate the carriage return at the end of a

paragraph from the carriage return at the end of each line.

It is important that there be no characters between the two

carriage returns in order to facilitate machine

identification of the dual carriage return.



<pp>6</pp>



4)     All titles within the document text should be

preceded by an extra carriage return (for a total of three

carriage returns).



<B>Rationale</B>: Providing the third carriage return after

paragraphs which precede titles makes it easy to identify

titles automatically in a document.



5)     All carriage returns should be followed by a line

feed character.



<B>Rationale</B>: MS-DOS and other environments provide a

line feed following each carriage return in the document.

Documents in the Apple Macintosh environment, however, do

not provide any line feed following the carriage return.  A

document with line feeds in either environment is quite

readable, although in the Apple Macintosh environment each

line is preceded by a square bracket on the screen.  If the

line feeds are left out in MS-DOS documents, however, some

software will have difficulty with the document.  The

recommendation is therefore to provide a line feed with

every carriage return.  For any environments in which the

line feed is superfluous, it can be very easily removed

using a search-and-replace command.  It is expected that

translation programs will also be developed that will remove

all ICADD format tags from a document and change them

directly into format commands for popular word processors

(WordPerfect, Microsoft Word, MacWrite, etc.).  When this is

done, the linefeeds can also be removed if appropriate.



6)     Text in an ICADD-ASCII formatted document is limited

to the ASCII characters with codes between 33 and 127, plus

SPACE, TAB, CARRIAGE RETURN (and LINE FEED), and FORM FEED

(new page).



<B>Rationale</B>: Characters above ASCII 127 are not

standardized.  They are also not supported by many programs

and readers.

7.  Proposed Format for Type 2 Documents (<B>ICADD-8)</B>

Format



The ICADD-8 format includes the six guidelines listed above,

plus eight additional tags that cover bold, italic, and

other emphasized text, as well as lists, footnotes, figure

descriptions, side-bars, and page numbers.  These tags are:



1.     BOLD:  <b>text to be bolded</b>





2.     ITALICS:  <it>text to be in italics</it>



3.     OTHER:  <other>Other emphasized text</other>

"Other" includes all emphasized text that is not bold,

italic, or bold & italic; for example, underlined text.



<pp>7</pp>



4.     LIST ITEM:

<litem>item in list</litem>

<litem>item in list</litem>

<litem>item in list</litem>



The principal reason for tagging items in a list is to

differentiate a list of single-spaced items (with a carriage

return at the end of each line) from a paragraph of running

text (which would also have a carriage return at the end of

each line).  Without some way of easily distinguishing a

list, screen reading and other automatic processing software

may strip out the carriage returns and change a list into a

stream of running text.  This would be devastating to most

lists, and particularly to lists such as Table of Contents.

Two options for handling lists are supported.  This first

option is to place standard SGML list item tags before and

after each item in a list.

Option 2:

<list>Item 1

Item 2

Item 3

Item 4</list>



The second option is a special ICADD-8 tag to be placed at

the beginning and end of a list.  With this option, instead

of putting a tag before and after each item in the list, a

tag is placed before and after the entire list.  This option

is provided to make it easier to read lists if a person is

not using a program that removes the tags.  It also makes it

easier for hand-tagged text to be created.  This second

option is particularly handy when dealing with Tables of

Contents and other similar lists, where each item in the

list occupies its own line, and the list items can occupy an

entire line.  (Adding tags before and after each item would

cause all of the lines to wrap and break up or be longer

than 78 characters.)



<b>Reviewers Note!</b>  Note that this is <b>not</b> a

standard SGML tag.  It also violates one of the constraints

stated above, which says that all of the ICADD

specifications should be subsets of each other.  It does

appear, however, to be a very useful option.  Comments pro

and con are invited.



5.     FOOTNOTE:  <fn>footnoted text</fn>



Footnoted text should be placed in the text and not at the

bottom of the page so that it is close to the item it refers

to.  .



6.     FIGURE DESCRIPTION:  <figure>Text in a figure

description</figure>

<figure>Figure caption</figure>



This tag is used both for figure captions and for

descriptions of figures.  Descriptions should be provided

for all figures, pictures, or other illustrations which are

not completely redundant with the text of the document.



<pp>8</pp>



7.     BOXES AND BLOCKED TEXT:  <box>Text in a box, side-

bars, etc.</box>



Tag all boxed text (e.g., Sidebars, Historical Notes and

other miscellaneous inserted text), and place them within

the running text of the document at a location similar to

their location in the printed document.



8.     PRINT PAGE REFERENCE:  print page reference



When a document is converted to ASCII text, it almost always

ends up on a different page number than the original, or it

appears as a continuous text file with no page delimiters.

In both cases, it is not possible to make any sense out of

page references in the original text document (e.g., "See

page 5") or the index on a Table of Contents.  It is also

difficult to discuss the document with people using a print

copy.  Preserving the page boundaries of the original

printed document is therefore often important.





7.a. Tag Rationale



These tags (with the exception of the second list option)

are all taken directly from the standard SGML tags that are

used in the formal Type 3 documents.  The purpose of this

minimal set of eight tags is to allow tagging of very common

formatting information in the informal documents, in order

either to preserve formatting information important to

understanding the text or to make it easier for automatic

text readers to deal with these documents.





8.  Request for Input



This is a working document, and input of all types is

solicited.  Because of the pressure to put out a first

release of this standard, however, please send comments

sooner versus later.  Also, in order to get the widest

possible review and input to the document, please code and

redisseminate to any people or forums you think would be

interested.



You can send comments directly to the subcommittee chair via

e-mail, regular mail, or fax:



<LIST>



ICADD ASCII Subcommittee

c/o Gregg Vanderheiden (chair)

S-151 Waisman Center

1500 Highland Avenue

Madison, WI  53705

608/262-6966 voice

608/263-5408 TT/TDD

608/262-8848 fax

vanderhe@macc.wisc.edu



</LIST>



.

