WebLadder Project

                                  Online WebLadder Project
                                  (See FAQ for details.)


                                  TextLadder


                                  Textladder 3.0

                                    TextLadder 3.0 is a new version of the traditional TextLadder program.
                                  Although details about the earlier TextLadder can be learned below,
                                  here's a brief description of it. TextLadder 1.0/2.0 had the ability to
                                  sort through a collection of texts in a given domain (e.g. economics,
                                  physics, medicine, etc.) and sort these texts into a special order, such
                                  that students reading the texts in this order would learn the new
                                  vocabulary incrementally, at a pace they could handle. However, TextLadder
                                  was only able to do this for domains for which there existed domain-
                                  specific word lists. This meant that it was up to the TextLadder user
                                  to go out and find or make these domain-specific lists, a process both
                                  time-consuming and difficult.

                                    TextLadder 3.0 completely avoids this problem: in the new
                                  TextLadder no lists are necessary. This is because the program is
                                  now capable of compiling its own word lists. All that the user is
                                  responsible for is providing the collection of texts, all of which must
                                  be from the same domain.

                                    There are a number of other relatively new options that appeared only
                                  in the later versions of TextLadder 2.0. One is the ability to organize
                                  the texts into subgroups, so that texts within subgroups are not separated
                                  from each other when the reading order of the texts is constructed. Another
                                  option is to have the user decide in advance exactly what the reading order
                                  should be, a choice which eliminates TextLadder's ability to control the
                                  incremental introduction of new vocabulary, but which still gives the reader
                                  some of the "fringe benefits" of TextLadder -- such as the list of new
                                  vocabulary presented prior to reading each text. One last new feature which
                                  users will find interesting is the ability to link all the words in each
                                  text and each pre-reading list to an online dictionary, so that a reader
                                  can click on any word they don't know and get an immediate dictionary
                                  definition or translation.

                                    Below you'll find the help documentation for the previous version of
                                  TextLadder, version 2.0. Relevant to the current version, 3.0, is "Part-
                                  "of-speech tagging support", under section 2, and "Length of number of
                                  texts" through "Recommended number of texts," under section 5. If you
                                  have any questions at all about TextLadder 3.0, please don't hesitate to
                                  either write me or leave me a message on the message board.




                                  TextLadder 2.0

                                  1. General Remarks
                                  2. Changes from version 1.0
                                             Part-of-speech tagging support
                                             SOWFI support
                                             Different algorithm
                                             More user-changeable defaults
                                  3. Old features from 1.0
                                             Rule-based parsing engine
                                             Proper noun detection
                                             Production of output
                                  4. Word Lists
                                             Description
                                             The SOWFI classification system
                                             Download
                                  5. Important Notes
                                             Processing time
                                             Windows XP and Windows 2000 vs. Windows 95, 98 and Me
                                             Windows XP display bug
                                             Windows XP part-of-speech tagging
                                             The content of ScreeningOutput.txt
                                             Length and number of texts processed by TextLadder
                                             Mid-sentence line breaks
                                             Preparing texts for processing
                                             Recommended number of texts (for pedagogical purposes)


                                  General remarks
                                     This documentation assumes you have read the
article online at
                                  LLT. If you have not, you should at the very least read this extract
                                  from the article and the definitions that follow it before continuing to
                                  read on:

                                  "This article addresses the problem of how to bring foreign language
                                  students with a limited vocabulary consisting mainly of high-frequency
                                  words, to the point where they are able to adequately comprehend
                                  authentic texts in a target domain or genre. It proposes bridging the
                                  vocabulary gap by determining which word families account for 95%
                                  of the target domain's running words, and then having students learn
                                  these word families by reading texts in an order that allows for the
                                  incremental introduction of target vocabulary. This is made possible by
                                  a recently developed computer program that sorts through a collection
                                  of texts and 1) finds texts with a suitably high proportion of target
                                  words, 2) ensures that over the course of these texts, most or all target
                                  words are encountered 5 or more times, and 3) creates an order for
                                  reading these texts, such that each new text contains a reasonably
                                  small number of new target words and a maximum number of familiar
                                  words."

                                  In the following documentation,

                                  Sequence list refers to the list of texts arranged into the order
                                  mentioned above

                                  Pre-reading list refers to the list of new words that a student needs
                                  to familiarize him/herself with via a dictionary (preferably CD-ROM or
                                  Internet-based) prior to reading each text

                                  Included low-frequency words refers to words that occur fewer than
                                  5 times throughout the texts that are included on the Pre-Reading list.


                                  Changes from version 1.0
                                  TextLadder version 1.0 was created in 2000 and is the version
                                  described in the article in
Language Learning & Technology. Anyone
                                  who has read the article should be aware that TextLadder 2.0
                                  differs from TextLadder 1.0 in the following ways:

                                  Part-of-speech tagging support
                                  TextLadder 2.0 supports part-of-speech tagging. This means that
                                  if the part-of-speech option in TextLadder is selected, words in the
                                  outputted pre-reading list are accompanied by part-of-speech
                                  information ("n.", "adj.", etc.). (See the Sample Output section for an
                                  example.) At this time, TextLadder only supports the Brown tagset,
                                  which is the same tagset used by Brill. Users have two options: they
                                  can install Brill on their computer and have TextLadder run Brill for them
                                  (TextLadder also formats the texts for Brill prior to tagging and
                                  afterwards renames the tagged files and puts them in their own folder),
                                  or they can manually run the tagger and rename the files themselves. (If
                                  the second option is chosen, users should contact me about exactly
                                  how to go about doing this.)

                                  SOWFI support
                                  TextLadder 2.0 supports SOWFI classification. SOWFI stands for
                                  "semantically opaque word family item". It refers to individual items
                                  within word families whose meanings are not guessable on the basis of
                                  the word family's base meaning, even though there is some kind of
                                  semantic relationship connecting it. Examples of SOWFI's are "shortly"
                                  within the word family "short", "governor" within the word family
                                  "govern", "earthy" within the word family "earth", and "namely" within the
                                  word family "name". It is important that TextLadder distinguish between
                                  SOWFI's and other members of the word family, because one cannot
                                  assume that a learner who has encountered another member of the
                                  word family will subsequently be able to understand a SOWFI when
                                  confronted with it. Therefore, although a SOWFI adds to the overall
                                  frequency count of its word family during the TextLadder sorting
                                  process, it is given its own appearance in the pre-reading section. (The
                                  SOWFI classification system itself is described in more detail below.)

                                  Different algorithm
                                  TextLadder 2.0 uses a different algorithm from 1.0 when arranging
                                  texts into a sequence list. This algorithm involves first trimming the
                                  word lists to fit the particular corpus of texts being processed (note
                                  that a separate trimmed list is created: the old lists remain intact).
                                  The trimmed list still provides coverage that is above the reading
                                  comprehension threshold (set at 95% by default), but contains fewer
                                  low-frequency words. This means that sequence lists are shorter AND
                                  that there are fewer low-frequency words included in the pre-reading list
                                  than in version 1.0. Note that the actual number of low-frequency words
                                  in the pre-reading lists can be decreased even further by changing one
                                  of TextLadder's default values, but there will be a corresponding
                                  increase in the number of texts on the sequence list. In other words,
                                  there is a trade-off between the number of low-frequency words in the
                                  pre-readings and the length of the sequence list.

                                  User-changeable default values
                                  There has been a change in the number and kind of default values
                                  changeable by the user in TextLadder 2.0.

                                  a) Minimum number of encounters
                                        In version 1.0, you could change the minimum number of
                                  encounters required from 5 to 6; in version 2.0 you can change it to
                                  any value you want.

                                  b) Maximum number of new words per text
                                        In version 1.0, the maximum number of new words per text was set
                                  permanently at 25. In version 2.0, this value can be changed by the user
                                  to any value desired. As with version 1.0, however, there will be certain
                                  cases (especially among the first few texts) where TextLadder will be
                                  unable to respect the maximum limit.

                                  c) Theoretical reading comprehension coverage threshold
                                     In version 1.0, this threshold was set permanently at 95%. In version
                                  2.0, the threshold can be changed to anything between 1 and
                                  100%. However, users should be aware that they should not change
                                  this value simply because their lists account for more than 95% of the
                                  texts being processed: TextLadder will detect that automatically and
                                  take it into account. Users should only change this value if they believe
                                  the 95% value itself is theoretically unsound.

                                  d) Lowest number of repetitions of low frequency words
                                     This value relates to the algorithm used by TextLadder 2.0 to
                                  create the sequence list. Once all high frequency words in the corpus
                                  have been encountered among texts already on the sequence list,
                                  TextLadder switches gears and begins looking for texts with the highest
                                  number of repetitions of previously-encountered low-frequency words.
                                  The "lowest number of repetitions of low frequency words" value tells
                                  TextLadder when to stop looking for these repetitions. E.g. if the value is
                                  set at 20 (the default value), TextLadder will stop the process of adding
                                  texts to the sequence list once the number of repetitions of low-
                                  frequency words per text drops below 20.

                                  e) Build a sequence list vs. Create list of high-frequency non-list words
                                     By default this value is set to "Build a sequence list" in TextLadder
                                  2.0. However, the user has the option to select the other option,
                                  which tells TextLadder not to build a sequence list but instead look for
                                  the highest frequency NON-list words. Such a list could be useful to a
                                  user wanting to put together a "rough and ready" domain specific list.
                                  For example, if a user wanted to put together a very rough geology-
                                  related domain-specific list, s/he could collect together a group of
                                  academic texts related to geology, then open TextLadder, select this
                                  option, choose Nation's 2,000-most-frequent-words-list and Coxhead's
                                  Academic Word List as the coverage lists, and run TextLadder on the
                                  texts. The user could then see what high-frequency non-list words
                                  emerge, and use them as the basis for a very rough domain-specific
                                  word list. Note that TextLadder provides not just frequency but also
                                  range information about these non-list words, so the user could avoid
                                  including a word that occurred 15 times in only one text in her domain-
                                  specific list.

                                  f) Include non-list words and very low frequency words in pre-reading
                                  list
                                     By default this value is set to "No", meaning that only list words will
                                  appear in the pre-reading list (not necessarily all the list words in each
                                  text, but rather only as many as is necessary to reach the reading
                                  comprehension threshold, set at 95% token coverage by default). Set
                                  this value to "Yes" if you want EVERY unfamiliar word in each text listed
                                  in the pre-reading list.

                                  g)Processing Options: "Standard Processing" vs. "SubGroups Restriction
                                  Processing" vs. "Pre-Set Order Processing"

                                  * Standard processing: TextLadder arranges texts into the best
                                  possible order for vocabulary introduction, keeping the total number
                                  of texts to be read as short as possible.

                                  * Subgroups restriction: TextLadder accepts a restriction
                                  on its processing: it does not separate texts within a subgroup.
                                  Within the subgroups option, there are various sub-options
                                  the user can choose among:

                                     - The user has the option either to allow TextLadder to arrange the
                                  order of the subgroups themselves as it sees fit, or to arrange
                                  the subgroups according to a pre-set order.

                                     - The user has the option of specifying how many texts in the
                                  subgroups should appear in the final output. Thus, a user could,
                                  for example, input subgroups with 10 or more texts each, but
                                  specify that only 4 texts from each subgroup should appear on the
                                  final sequence list. This would mean that TextLadder would choose
                                  the best 4 texts from each subgroup.

                                       ** As a variant on this last option, the user would
                                  be able to specify the number of texts from a subgroup that
                                  should appear on the final sequence list for each individual
                                  subgroup. Thus, while the number of texts specified for one
                                  subgroup might be 4, the number specified for another might be 7,
                                  and so on.

                                     - The user has the option to either have TextLadder eliminate all
                                  texts among the subgroups that do not pass the reading comprehension
                                  threshold (which is one of the defaults that may be changed, though
                                  originally set at 95% -- see above), or not to eliminate these texts.
                                  During standard processing, texts that do not pass the threshold
                                  are automatically eliminated, but a user who wants subgroup texts
                                   to be included even when they don't pass the threshold now have
                                  that option.

                                  * Pre-set order option: TextLadder will do no arranging whatsoever
                                  but will leave the texts in the exact order entered by the user.
                                  Within the pre-set order option:

                                      - As with the subgroups, the user has the option either to have
                                  TextLadder eliminate texts which do not pass the reading
                                  comprehension threshold or not.

                                  h) Option to "Create HTML versions of text files and pre-reading list"

                                  The user has the option to have TextLadder create HTML versions of
                                  the text files being processed as well as the outputted pre-reading
                                  list. The words in these HTML versions will be linkable to an online
                                  dictionary website. Students reading the texts or pre-reading lists
                                  will therefore be able to click on words they don't understand and
                                  get an instant dictionary definition or translation.

                                  NOTE: TextLadder does not supply hyperlink references for any website
                                  dictionaries. It it up to the user to research such websites and
                                  supply the reference in the box TextLadder provides. Also, it is the
                                  user's responsibility (where necessary) to contact the website and
                                  ask permission to link to them.

                                  The outputted HTML pre-reading list will appear in the TextLadder
                                  program subfolder with the rest of the output files. The HTML
                                  versions of the text files will appear in a subfolder called
                                  "TextLadderHTML", which will be located in the same folder (or
                                  folders) that contain the user's text files.

                                  i) Proper nouns
                                     Note that the option to not include proper nouns as familiar words
                                  when calculating token coverage, was included in version 1.0 but is not
                                  included in version 2.0. This means that proper nouns are always
                                  considered "familiar items" by TextLadder. I may add this option back
                                  again, depending on the demand for it.


                                  Old Features from Version 1.0

                                  Rule-based parsing engine
                                  TextLadder still uses the same rule-based parsing engine to match
                                  words not on the selected word lists to words from the same word
                                  families that ARE on the word lists. For example, if "tax" is included in
                                  the word list, TextLadder's parsing engine will be able to match "taxes",
                                  "taxation", and "untaxed" to it, even though these three words are not on
                                  the word list. Of course, there are many exceptions in the English
                                  prefixing and suffixing system, and TextLadder has been trained on the
                                  Brown corpus with Nation's 2,000 high frequency word list and
                                  Coxhead's AWL in order to learn to detect these exceptions. However,
                                  its accuracy rate is not 100%, particularly for word families
                                  outside these 3 lists. Still, from what I have seen so far its accuracy
                                  is pretty good, perhaps because many of the most "problematic" items
                                  (in terms of orthographic irregularities) are from (or bear a resemblance
                                  to) high-frequency Anglo-Saxon word families of the kind found in
                                  Nation's list.

                                  Proper noun detection
                                  Built into the above engine is the ability to distinguish between words
                                  in a text are that truly proper nouns and words which only appear to be
                                  proper nouns (e.g. because they're at the start of the sentence). As with
                                  the suffix and prefix-based predictions, accuracy is not 100%.

                                  Production of output
                                  The same pieces of output are generated: Screening output, Sequence
                                  List output, Included Low-Frequency Words output, and Final Pool
                                  Contents output. (See the article for details on these.) The one
                                  optional addition is the list of high-frequency non-list words
                                  (described in the "User-changeable default values" section above).


                                  Word Lists

                                  Description
                                  The following lists are modified versions of Paul Nation's thousand
                                  most frequent word families of English list and second thousand most
                                  frequent word families of English list, and Averil Coxhead's Academic
                                  Word List. The lists have been modified for use with TextLadder in the
                                  following way: they list the word family headwords and, under each
                                  headword, include only particular word family items that are
                                  orthographically irregular or semantically opaque. The orthographically
                                  irregular items are listed because the TextLadder parsing engine can
                                  only successfuly predict orthographically regular items, and the
                                  semantically opaque items are listed so that TextLadder can be
                                  SOWFI-sensitive.

                                  Note to users: nobody should confuse these lists with the original
                                  lists (Nation's lists are viewable online at the following address:
                                  
www1.harenet.ne.jp/~waring/Wordlists/vocfreq.html;
                                  Coxhead's list is viewable at www.vuw.ac.nz/lals/div1/awl/)
                                  The lists available below have been modified for use with
                                  TextLadder or TextLadder-like parsing engines, and are meant to be
                                  used as such. If you have questions regarding these modified lists
                                  (e.g. questions about asterisk placement, why certain items are
                                  considered orthographically irregular, etc.), please DO NOT address
                                  them to the authors of the original lists.

                                  The SOWFI classification system
                                  The below lists also act as a model to users attempting to put together
                                  their own SOWFI-ready lists for use with TextLadder. For those of you
                                  who fall into this category, or who are simply curious about what all
                                  the asterisks in the list mean, I'm going to take a quick moment to
                                  explain the SOWFI classification system.

                                  When two or more words within a word family have the same number
                                  of asterisks, this means that they are "linked SOWFI's." For example,
                                  the word family "determine" is listed as follows:

                                  determine
                                         determinate *
                                         indeterminate *

                                  Both "determinate" and "indeterminate" are considered linked SOWFI's;
                                  i.e. having encountered one means one should be able to guess the
                                  meaning of the other.

                                  When words within a word family have different numbers of asterisks,
                                  they are separate SOWFI's. For example, the word family "technical" is
                                  listed as follows:

                                  technical
                                         technically *
                                         technicality **

                                  "Technically" and "technicality" are considered separate SOWFI's; i.e.
                                  they will each appear on their own in the Pre-Reading list (under the
                                  section reserved for SOWFI's from previously encountered word
                                  families, called "Old Words with New Meanings").

                                  A personal note about the process of identifying SOWFI's:
                                  I cannot overemphasize how subjective the process of identifying
                                  SOWFI's is. There will definitely be items on the below lists
                                  identified as SOWFI's that you think shouldn't be, and perhaps some
                                  items not identified as SOWFI's that you think should be. In the end
                                  I decided to be very liberal in identifying SOWFI's, placing asterisks
                                  next to almost anything with the potential for semantic opacity. I felt
                                  that there was less harm in overidentifying than underidentifying, since
                                  if the item isn't semantically opaque, the reader will simply skip over it
                                  during the Pre-Reading section. Still, please feel free to remove
                                  asterisks that you feel are unwarranted from your copy of these lists.

                                  Download
                                  You can download the Lists just below. Included with the three other
                                  lists in the zip file is a very rough copy of an Economics domain-
                                  specific list I've generated on the basis of information from 261
                                  Economics-related news texts. I'm supplying it here so that TextLadder
                                  users have at least one domain (Economics) for which the lists in
                                  this package provide 95% coverage (in conjunction with proper nouns).

                                  Download lists


                                  (Feb. 22, 2002: A fifth list, consisting of high-frequency word families
                                  found in Voice of America simplified news texts, is available here
                                  and is useful for anyone looking to get 95% coverage of Voice of
                                  America Special English texts.)



                                  Important Notes

                                  1. Processing Time
                                  Once you choose the texts you want to process and the coverage lists
                                  you'd like TextLadder to use, TextLadder will load some files and then
                                  begin the screening process. Once the screening process is finished,
                                  TextLadder will tell you how many texts passed the screening and ask
                                  you if you want to continue. If you click "Yes", TextLadder will begin
                                  the actual sorting which takes a fairly long time - anywhere from 15
                                  minutes to 2 hours, depending on the number of texts and the speed of
                                  your computer. (Note that I'm referring to the "Build sequence list"
                                  option here, not the "Generate list of high-frequency non-list words"
                                  option, which goes a lot faster.)

                                  2. Windows XP and Windows 2000 vs. Windows 95, 98 and Me
                                  In versions of TextLadder prior to 2.10 (i.e. versions 2.00 through
                                  2.09) TextLadder ran 5 to 7 times slower on Windows 98/Me than
                                  on Windows 2000/XP. However, with versions 2.10 and later there
                                  is only a small difference between TextLadder's performance
                                  on Windows 2000/XP and its performance on Windows 98/Me
                                  (TextLadder runs approximately 1.5 times slower on 98/Me,
                                  all other things being equal).

                                  3. Windows XP display bug
                                  Windows XP users may find that TextLadder's display window
                                  freezes at some point. However, TextLadder is in fact running
                                  perfectly: it is is simply unable to refresh the display window. To
                                  confirm that TextLadder is in fact still running, go into the TextLadder
                                  program folder, then into the Output subfolder, and click on the file
                                  "LastUpdated.txt". This file is updated by TextLadder at regular
                                  intervals: if the time listed in this file is less than 5 to 10 minutes
                                  earlier, then you know that TextLadder is running OK.

                                  4. Windows XP part-of-speech tagging
                                  Please note that the version of Brill that was compiled for Windows/
                                  DOS using djgpp does not work in Windows XP. Instead, Windows XP
                                  users must download the version of Brill that was compiled for
                                  Windows using MinGW32. Both versions of Brill are available for
                                  download
here.

                                  5. The content of ScreeningOutput.txt
                                  When TextLadder tells you how many texts passed the screening and
                                  asks you if you want to continue, you can check the file
                                  ScreeningOutput.txt for more details about the results of the
                                  screening process (e.g. the exact coverage figures for each text).
                                  In versions of TextLadder prior to 2.15, the content of ScreeningOutput.txt
                                  changed over the course of the processing (first giving coverage figures
                                  vis-a-vis the original word lists, then giving coverage figures
                                  vis-a-vis the trimmed word list). In version 2.15 and subsequent versions,
                                  ScreeningOutput.txt (and its co-product, ScreeningOutputNotPassed.txt)
                                  gives coverage figures only vis-a-vis the original word lists.

                                  6. Length and number of texts processed by TextLadder
                                      It is recommended that you use texts not much longer than 3000
                                  words if you want to avoid having texts excluded during the screening
                                  process for being too long. Also, note that you can have as many as
                                  750 texts for processing.
                                      For the "compile a list of high-frequency non-list words" option, texts
                                  can be as long as 2,250,000 words. However, certain restrictions apply:
                                  all the words in all the texts cannot exceed this 2,250,000 word-limit.
                                  Also, it is better to have either a few very large texts or many small
                                  texts, but not both together.

                                  7. Mid-sentence line-breaks
                                  If you plan to use the part-of-speech tagging option, make sure that the
                                  texts you are using do not contain mid-sentence line-breaks. Mid-
                                  sentence line breaks usually occur when a text is copied-and-pasted,
                                  e.g. off the Internet. To avoid this (if you are using texts from the web),
                                  save the html page in text format, rather than copying-and-pasting.

                                  8. Preparing texts for processing
                                  A final cautionary note: always remember to edit anything you do not
                                  want included in your texts out of the texts before processing.
                                  Otherwise, TextLadder will pick it up and include it as part of the
                                  analysis.

                                  9. Recommended number of texts (for pedagogical purposes)
                                  Although this is a somewhat arbitrary number, I would recommend 200
                                  as the minimum number of texts you should aim to have pass the cut-
                                  off, if you are using TextLadder to compile a sequence list/pre-reading
                                  list/list of included low-frequency words for pedagogical use. Of course,
                                  the outputted sequence list will obviously include far fewer than 200
                                  texts, but having more than 200 texts for TextLadder to choose from will
                                  improve the quality of your sequence list. In practice, to get 200 texts to
                                  pass the cut-off, you should be looking to collect between 200 and 300
                                  texts for TextLadder to analyze.






                                Home | Software | Documentation | Sample Output | Bug Reports | FAQ
                                Message Board | Contact Me


                               Template components provided by WEBalley
                               "Black Chancery" fonts in title from Stuff.uk.com
                               "Books in wheelbarrow" from unknown public domain source.
                                © Sina Ghadirian 2001