WebLadder Project

Online WebLadder Project
(See FAQ for details.)


TextLadder


Textladder 3.0

TextLadder 3.0 is a new version of the traditional TextLadder program. Although details about the earlier TextLadder can be learned below, here's a brief description of it. TextLadder 1.0/2.0 had the ability to sort through a collection of texts in a given domain (e.g. economics, physics, medicine, etc.) and sort these texts into a special order, such that students reading the texts in this order would learn the new vocabulary incrementally, at a pace they could handle. However, TextLadder was only able to do this for domains for which there existed domain- specific word lists. This meant that it was up to the TextLadder user to go out and find or make these domain-specific lists, a process both time-consuming and difficult.

TextLadder 3.0 completely avoids this problem: in the new TextLadder no lists are necessary. This is because the program is now capable of compiling its own word lists. All that the user is responsible for is providing the collection of texts, all of which must be from the same domain.

There are a number of other relatively new options that appeared only in the later versions of TextLadder 2.0. One is the ability to organize the texts into subgroups, so that texts within subgroups are not separated from each other when the reading order of the texts is constructed. Another option is to have the user decide in advance exactly what the reading order should be, a choice which eliminates TextLadder's ability to control the incremental introduction of new vocabulary, but which still gives the reader some of the "fringe benefits" of TextLadder -- such as the list of new vocabulary presented prior to reading each text. One last new feature which users will find interesting is the ability to link all the words in each text and each pre-reading list to an online dictionary, so that a reader can click on any word they don't know and get an immediate dictionary definition or translation.

Below you'll find the help documentation for the previous version of TextLadder, version 2.0. Relevant to the current version, 3.0, is "Part- "of-speech tagging support", under section 2, and "Length of number of texts" through "Recommended number of texts," under section 5. If you have any questions at all about TextLadder 3.0, please don't hesitate to contact me.




TextLadder 2.0

1. General Remarks
2. Changes from version 1.0
Part-of-speech tagging support
SOWFI support
Different algorithm
More user-changeable defaults
3. Old features from 1.0
Rule-based parsing engine
Proper noun detection
Production of output
4. Word Lists
Description
The SOWFI classification system
Download
5. Important Notes
Processing time
Windows XP and Windows 2000 vs. Windows 95, 98 and Me
Windows XP display bug
Windows XP part-of-speech tagging
The content of ScreeningOutput.txt
Length and number of texts processed by TextLadder
Mid-sentence line breaks
Preparing texts for processing
Recommended number of texts (for pedagogical purposes)


General remarks
This documentation assumes you have read the
article online at
LLT. If you have not, you should at the very least read this extract from the article and the definitions that follow it before continuing to read on:

"This article addresses the problem of how to bring foreign language students with a limited vocabulary consisting mainly of high-frequency words, to the point where they are able to adequately comprehend authentic texts in a target domain or genre. It proposes bridging the vocabulary gap by determining which word families account for 95% of the target domain's running words, and then having students learn these word families by reading texts in an order that allows for the incremental introduction of target vocabulary. This is made possible by a recently developed computer program that sorts through a collection of texts and 1) finds texts with a suitably high proportion of target words, 2) ensures that over the course of these texts, most or all target words are encountered 5 or more times, and 3) creates an order for reading these texts, such that each new text contains a reasonably small number of new target words and a maximum number of familiar words."

In the following documentation,

Sequence list refers to the list of texts arranged into the order mentioned above

Pre-reading list refers to the list of new words that a student needs to familiarize him/herself with via a dictionary (preferably computer- or Internet-based) prior to reading each text

Included low-frequency words refers to words that occur fewer than 5 times throughout the texts that are included on the Pre-Reading list.


Changes from version 1.0
TextLadder version 1.0 was created in 2000 and is the version described in the article in
Language Learning & Technology. Anyone who has read the article should be aware that TextLadder 2.0 differs from TextLadder 1.0 in the following ways:

Part-of-speech tagging support
TextLadder 2.0 supports part-of-speech tagging. This means that if the part-of-speech option in TextLadder is selected, words in the outputted pre-reading list are accompanied by part-of-speech information ("n.", "adj.", etc.). (See the Sample Output section for an example.) At this time, TextLadder only supports the Brown tagset, which is the same tagset used by Brill. Users have two options: they can install Brill on their computer and have TextLadder run Brill for them(TextLadder also formats the texts for Brill prior to tagging and afterwards renames the tagged files and puts them in their own folder), or they can manually run the tagger and rename the files themselves. (If the second option is chosen, users should contact me about exactly how to go about doing this.)

SOWFI support TextLadder 2.0 supports SOWFI classification. SOWFI stands for "semantically opaque word family item". It refers to individual items within word families whose meanings are not guessable on the basis of the word family's base meaning, even though there is some kind of semantic relationship connecting it. Examples of SOWFI's are "shortly" within the word family "short", "governor" within the word family "govern", "earthy" within the word family "earth", and "namely" within the word family "name". It is important that TextLadder distinguish between SOWFI's and other members of the word family, because one cannot assume that a learner who has encountered another member of the word family will subsequently be able to understand a SOWFI when confronted with it. Therefore, although a SOWFI adds to the overall frequency count of its word family during the TextLadder sorting process, it is given its own appearance in the pre-reading section. (The SOWFI classification system itself is described in more detail below.)

Different algorithm TextLadder 2.0 uses a different algorithm from 1.0 when arranging texts into a sequence list. This algorithm involves first trimming the word lists to fit the particular corpus of texts being processed (note that a separate trimmed list is created: the old lists remain intact). The trimmed list still provides coverage that is above the reading comprehension threshold (set at 95% by default), but contains fewer low-frequency words. This means that sequence lists are shorter AND that there are fewer low-frequency words included in the pre-reading list than in version 1.0. Note that the actual number of low-frequency words in the pre-reading lists can be decreased even further by changing one of TextLadder's default values, but there will be a corresponding increase in the number of texts on the sequence list. In other words, there is a trade-off between the number of low-frequency words in the pre-readings and the length of the sequence list.

User-changeable default values There has been a change in the number and kind of default values changeable by the user in TextLadder 2.0.

a) Minimum number of encounters In version 1.0, you could change the minimum number of encounters required from 5 to 6; in version 2.0 you can change it to any value you want.

b) Maximum number of new words per text In version 1.0, the maximum number of new words per text was set permanently at 25. In version 2.0, this value can be changed by the user to any value desired. As with version 1.0, however, there will be certain cases (especially among the first few texts) where TextLadder will be unable to respect the maximum limit.

c) Theoretical reading comprehension coverage threshold In version 1.0, this threshold was set permanently at 95%. In version 2.0, the threshold can be changed to anything between 1 and 100%. However, users should be aware that they should not change this value simply because their lists account for more than 95% of the texts being processed: TextLadder will detect that automatically and take it into account. Users should only change this value if they believe the 95% value itself is theoretically unsound.

d) Lowest number of repetitions of low frequency words This value relates to the algorithm used by TextLadder 2.0 to create the sequence list. Once all high frequency words in the corpus have been encountered among texts already on the sequence list, TextLadder switches gears and begins looking for texts with the highest number of repetitions of previously-encountered low-frequency words. The "lowest number of repetitions of low frequency words" value tells TextLadder when to stop looking for these repetitions. E.g. if the value is set at 20 (the default value), TextLadder will stop the process of adding texts to the sequence list once the number of repetitions of low- frequency words per text drops below 20.

e) Build a sequence list vs. Create list of high-frequency non-list words By default this value is set to "Build a sequence list" in TextLadder 2.0. However, the user has the option to select the other option, which tells TextLadder not to build a sequence list but instead look for the highest frequency NON-list words. Such a list could be useful to a user wanting to put together a "rough and ready" domain specific list. For example, if a user wanted to put together a very rough geology- related domain-specific list, s/he could collect together a group of academic texts related to geology, then open TextLadder, select this option, choose Nation's 2,000-most-frequent-words-list and Coxhead's Academic Word List as the coverage lists, and run TextLadder on the texts. The user could then see what high-frequency non-list words emerge, and use them as the basis for a very rough domain-specific word list. Note that TextLadder provides not just frequency but also range information about these non-list words, so the user could avoid including a word that occurred 15 times in only one text in her domain- specific list.

f) Include non-list words and very low frequency words in pre-reading list By default this value is set to "No", meaning that only list words will appear in the pre-reading list (not necessarily all the list words in each text, but rather only as many as is necessary to reach the reading comprehension threshold, set at 95% token coverage by default). Set this value to "Yes" if you want EVERY unfamiliar word in each text listed in the pre-reading list.

g)Processing Options: "Standard Processing" vs. "SubGroups Restriction Processing" vs. "Pre-Set Order Processing"

* Standard processing: TextLadder arranges texts into the best possible order for vocabulary introduction, keeping the total number of texts to be read as short as possible.

* Subgroups restriction: TextLadder accepts a restriction on its processing: it does not separate texts within a subgroup. Within the subgroups option, there are various sub-options the user can choose among:

- The user has the option either to allow TextLadder to arrange the order of the subgroups themselves as it sees fit, or to arrange the subgroups according to a pre-set order.

- The user has the option of specifying how many texts in the subgroups should appear in the final output. Thus, a user could, for example, input subgroups with 10 or more texts each, but specify that only 4 texts from each subgroup should appear on the final sequence list. This would mean that TextLadder would choose the best 4 texts from each subgroup.

(As a variant on this last option, the user would be able to specify the number of texts from a subgroup that should appear on the final sequence list for each individual subgroup. Thus, while the number of texts specified for one subgroup might be 4, the number specified for another might be 7, and so on.)

- The user has the option to either have TextLadder eliminate all texts among the subgroups that do not pass the reading comprehension threshold (which is one of the defaults that may be changed, though originally set at 95% -- see above), or not to eliminate these texts. During standard processing, texts that do not pass the threshold are automatically eliminated, but a user who wants subgroup texts to be included even when they don't pass the threshold now have that option.

* Pre-set order option: TextLadder will do no arranging whatsoever but will leave the texts in the exact order entered by the user. Within the pre-set order option:

- As with the subgroups, the user has the option either to have TextLadder eliminate texts which do not pass the reading comprehension threshold or not.

h) Option to "Create HTML versions of text files and pre-reading list"

The user has the option to have TextLadder create HTML versions of the text files being processed as well as the outputted pre-reading list. The words in these HTML versions will be linkable to an online dictionary website. Students reading the texts or pre-reading lists will therefore be able to click on words they don't understand and get an instant dictionary definition or translation.

NOTE: TextLadder does not supply hyperlink references for any website dictionaries. It it up to the user to research such websites and supply the reference in the box TextLadder provides. Also, it is the user's responsibility (where necessary) to contact the website and ask permission to link to them.

The outputted HTML pre-reading list will appear in the TextLadder program subfolder with the rest of the output files. The HTML versions of the text files will appear in a subfolder called "TextLadderHTML", which will be located in the same folder (or folders) that contain the user's text files.

i) Proper nouns
Note that the option to not include proper nouns as familiar words when calculating token coverage, was included in version 1.0 but is not included in version 2.0. This means that proper nouns are always considered "familiar items" by TextLadder. I may add this option back again, depending on the demand for it.


Old Features from Version 1.0

Rule-based parsing engine
TextLadder still uses the same rule-based parsing engine to match words not on the selected word lists to words from the same word families that ARE on the word lists. For example, if "tax" is included in the word list, TextLadder's parsing engine will be able to match "taxes", "taxation", and "untaxed" to it, even though these three words are not on the word list. Of course, there are many exceptions in the English prefixing and suffixing system, and TextLadder has been trained on the Brown corpus with Nation's 2,000 high frequency word list and Coxhead's AWL in order to learn to detect these exceptions. However, its accuracy rate is not 100%, particularly for word families outside these 3 lists. Still, from what I have seen so far its accuracy is pretty good, perhaps because many of the most "problematic" items (in terms of orthographic irregularities) are from (or bear a resemblance to) high-frequency Anglo-Saxon word families of the kind found in Nation's list.

Proper noun detection Built into the above engine is the ability to distinguish between words in a text are that truly proper nouns and words which only appear to be proper nouns (e.g. because they're at the start of the sentence). As with the suffix and prefix-based predictions, accuracy is not 100%.

Production of output The same pieces of output are generated: Screening output, Sequence List output, Included Low-Frequency Words output, and Final Pool Contents output. (See the article for details on these.) The one optional addition is the list of high-frequency non-list words (described in the "User-changeable default values" section above).

Word Lists

Description
The following lists are modified versions of Paul Nation's thousand most frequent word families of English list and second thousand most frequent word families of English list, and Averil Coxhead's Academic Word List. The lists have been modified for use with TextLadder in the following way: they list the word family headwords and, under each headword, include only particular word family items that are orthographically irregular or semantically opaque. The orthographically irregular items are listed because the TextLadder parsing engine can only successfuly predict orthographically regular items, and the semantically opaque items are listed so that TextLadder can be SOWFI-sensitive.

Note to users: nobody should confuse these lists with the original lists (Nation's lists are viewable online at the following address:
www1.harenet.ne.jp/~waring/Wordlists/vocfreq.html; Coxhead's list is viewable at www.vuw.ac.nz/lals/div1/awl/). The lists available below have been modified for use with TextLadder or TextLadder-like parsing engines, and are meant to be used as such. If you have questions regarding these modified lists (e.g. questions about asterisk placement, why certain items are considered orthographically irregular, etc.), please DO NOT address them to the authors of the original lists.

The SOWFI classification system
The below lists also act as a model to users attempting to put together their own SOWFI-ready lists for use with TextLadder. For those of you who fall into this category, or who are simply curious about what all the asterisks in the list mean, I'm going to take a quick moment to explain the SOWFI classification system. When two or more words within a word family have the same number of asterisks, this means that they are "linked SOWFI's." For example, the word family "determine" is listed as follows:

determine
    determinate *
    indeterminate *

Both "determinate" and "indeterminate" are considered linked SOWFI's; i.e. having encountered one means one should be able to guess the meaning of the other. When words within a word family have different numbers of asterisks, they are separate SOWFI's. For example, the word family "technical" is listed as follows:

technical
    technically *
    technicality **

"Technically" and "technicality" are considered separate SOWFI's; i.e. they will each appear on their own in the Pre-Reading list (under the section reserved for SOWFI's from previously encountered word families, called "Old Words with New Meanings"). A personal note about the process of identifying SOWFI's: I cannot overemphasize how subjective the process of identifying SOWFI's is. There will definitely be items on the below lists identified as SOWFI's that you think shouldn't be, and perhaps some items not identified as SOWFI's that you think should be. In the end I decided to be very liberal in identifying SOWFI's, placing asterisks next to almost anything with the potential for semantic opacity. I felt that there was less harm in overidentifying than underidentifying, since if the item isn't semantically opaque, the reader will simply skip over it during the Pre-Reading section. Still, please feel free to remove asterisks that you feel are unwarranted from your copy of these lists.

Download
You can download the Lists just below. Included with the three other lists in the zip file is a very rough copy of an Economics domain- specific list I've generated on the basis of information from 261 Economics-related news texts. I'm supplying it here so that TextLadder users have at least one domain (Economics) for which the lists in this package provide 95% coverage (in conjunction with proper nouns).

Download lists

(Feb. 22, 2002: A fifth list, consisting of high-frequency word families found in Voice of America simplified news texts, is available here and is useful for anyone looking to get 95% coverage of Voice of America Special English texts.)


Important Notes

1. Processing Time
Once you choose the texts you want to process and the coverage lists you'd like TextLadder to use, TextLadder will load some files and then begin the screening process. Once the screening process is finished, TextLadder will tell you how many texts passed the screening and ask you if you want to continue. If you click "Yes", TextLadder will begin the actual sorting which takes a fairly long time - anywhere from 15 minutes to 2 hours, depending on the number of texts and the speed of your computer. (Note that I'm referring to the "Build sequence list" option here, not the "Generate list of high-frequency non-list words" option, which goes a lot faster.)

2. Windows XP and Windows 2000 vs. Windows 95, 98 and Me In versions of TextLadder prior to 2.10 (i.e. versions 2.00 through 2.09) TextLadder ran 5 to 7 times slower on Windows 98/Me than on Windows 2000/XP. However, with versions 2.10 and later there is only a small difference between TextLadder's performance on Windows 2000/XP and its performance on Windows 98/Me (TextLadder runs approximately 1.5 times slower on 98/Me, all other things being equal).

3. Windows XP display bug
Windows XP users may find that TextLadder's display window freezes at some point. However, TextLadder is in fact running perfectly: it is is simply unable to refresh the display window. To confirm that TextLadder is in fact still running, go into the TextLadder program folder, then into the Output subfolder, and click on the file "LastUpdated.txt". This file is updated by TextLadder at regular intervals: if the time listed in this file is less than 5 to 10 minutes earlier, then you know that TextLadder is running OK.

4. Windows XP part-of-speech tagging
Please note that the version of Brill that was compiled for Windows/ DOS using djgpp does not work in Windows XP. Instead, Windows XP users must download the version of Brill that was compiled for Windows using MinGW32. Both versions of Brill are available for download
here.

5. The content of ScreeningOutput.txt
When TextLadder tells you how many texts passed the screening and asks you if you want to continue, you can check the file ScreeningOutput.txt for more details about the results of the screening process (e.g. the exact coverage figures for each text). In versions of TextLadder prior to 2.15, the content of ScreeningOutput.txt changed over the course of the processing (first giving coverage figures vis-a-vis the original word lists, then giving coverage figures vis-a-vis the trimmed word list). In version 2.15 and subsequent versions, ScreeningOutput.txt (and its co-product, ScreeningOutputNotPassed.txt) gives coverage figures only vis-a-vis the original word lists.

6. Length and number of texts processed by TextLadder
It is recommended that you use texts not much longer than 3000 words if you want to avoid having texts excluded during the screening process for being too long. Also, note that you can have as many as 750 texts for processing. For the "compile a list of high-frequency non-list words" option, texts can be as long as 2,250,000 words. However, certain restrictions apply: all the words in all the texts cannot exceed this 2,250,000 word-limit. Also, it is better to have either a few very large texts or many small texts, but not both together.

7. Mid-sentence line-breaks
If you plan to use thepart-of-speech tagging option, make sure that the texts you are using do not contain mid-sentence line-breaks. Mid- sentence line breaks usually occur when a text is copied-and-pasted, e.g. off the Internet. To avoid this (if you are using texts from the web), save the html page in text format, rather than copying-and-pasting.

8. Preparing texts for processing
A final cautionary note: always remember to edit anything you do not want included in your texts out of the texts before processing. Otherwise, TextLadder will pick it up and include it as part of the analysis.

9. Recommended number of texts (for pedagogical purposes)
Although this is a somewhat arbitrary number, I would recommend 200 as the minimum number of texts you should aim to have pass the cut- off, if you are using TextLadder to compile a sequence list/pre-reading list/list of included low-frequency words for pedagogical use. Of course, the outputted sequence list will obviously include far fewer than 200 texts, but having more than 200 texts for TextLadder to choose from will improve the quality of your sequence list. In practice, to get 200 texts to pass the cut-off, you should be looking to collect between 200 and 300 texts for TextLadder to analyze.






Home | Software | Documentation | Sample Output | Bug Reports | FAQ
Contact Me


Sina Ghadirian 2001