WebLadder Project
Online WebLadder Project
(See FAQ for details.)
TextLadder
Textladder 3.0
TextLadder 3.0 is a new version of the traditional TextLadder program.
Although details about the earlier TextLadder can be learned below,
here's a brief description of it. TextLadder 1.0/2.0 had the ability to
sort through a collection of texts in a given domain (e.g. economics,
physics, medicine, etc.) and sort these texts into a special order, such
that students reading the texts in this order would learn the new
vocabulary incrementally, at a pace they could handle. However, TextLadder
was only able to do this for domains for which there existed domain-
specific word lists. This meant that it was up to the TextLadder user
to go out and find or make these domain-specific lists, a process both
time-consuming and difficult.
TextLadder 3.0 completely avoids this problem: in the new
TextLadder no lists are necessary. This is because the program is
now capable of compiling its own word lists. All that the user is
responsible for is providing the collection of texts, all of which must
be from the same domain.
There are a number of other relatively new options that appeared only
in the later versions of TextLadder 2.0. One is the ability to organize
the texts into subgroups, so that texts within subgroups are not separated
from each other when the reading order of the texts is constructed. Another
option is to have the user decide in advance exactly what the reading order
should be, a choice which eliminates TextLadder's ability to control the
incremental introduction of new vocabulary, but which still gives the reader
some of the "fringe benefits" of TextLadder -- such as the list of new
vocabulary presented prior to reading each text. One last new feature which
users will find interesting is the ability to link all the words in each
text and each pre-reading list to an online dictionary, so that a reader
can click on any word they don't know and get an immediate dictionary
definition or translation.
Below you'll find the help documentation for the previous version of
TextLadder, version 2.0. Relevant to the current version, 3.0, is "Part-
"of-speech tagging support", under section 2, and "Length of number of
texts" through "Recommended number of texts," under section 5. If you
have any questions at all about TextLadder 3.0, please don't hesitate to
either write me or leave me a message on the message board.
TextLadder 2.0
1. General Remarks
2. Changes from version 1.0
Part-of-speech tagging support
SOWFI support
Different algorithm
More user-changeable defaults
3. Old features from 1.0
Rule-based parsing engine
Proper noun detection
Production of output
4. Word Lists
Description
The SOWFI classification system
Download
5. Important Notes
Processing time
Windows XP and Windows 2000 vs. Windows 95, 98 and Me
Windows XP display bug
Windows XP part-of-speech tagging
The content of ScreeningOutput.txt
Length and number of texts processed by TextLadder
Mid-sentence line breaks
Preparing texts for processing
Recommended number of texts (for pedagogical purposes)
General remarks
This documentation assumes you have read the article online at
LLT. If you have not, you should at the very least read this extract
from the article and the definitions that follow it before continuing to
read on:
"This article addresses the problem of how to bring foreign language
students with a limited vocabulary consisting mainly of high-frequency
words, to the point where they are able to adequately comprehend
authentic texts in a target domain or genre. It proposes bridging the
vocabulary gap by determining which word families account for 95%
of the target domain's running words, and then having students learn
these word families by reading texts in an order that allows for the
incremental introduction of target vocabulary. This is made possible by
a recently developed computer program that sorts through a collection
of texts and 1) finds texts with a suitably high proportion of target
words, 2) ensures that over the course of these texts, most or all target
words are encountered 5 or more times, and 3) creates an order for
reading these texts, such that each new text contains a reasonably
small number of new target words and a maximum number of familiar
words."
In the following documentation,
Sequence list refers to the list of texts arranged into the order
mentioned above
Pre-reading list refers to the list of new words that a student needs
to familiarize him/herself with via a dictionary (preferably CD-ROM or
Internet-based) prior to reading each text
Included low-frequency words refers to words that occur fewer than
5 times throughout the texts that are included on the Pre-Reading list.
Changes from version 1.0
TextLadder version 1.0 was created in 2000 and is the version
described in the article in Language Learning & Technology. Anyone
who has read the article should be aware that TextLadder 2.0
differs from TextLadder 1.0 in the following ways:
Part-of-speech tagging support
TextLadder 2.0 supports part-of-speech tagging. This means that
if the part-of-speech option in TextLadder is selected, words in the
outputted pre-reading list are accompanied by part-of-speech
information ("n.", "adj.", etc.). (See the Sample Output section for an
example.) At this time, TextLadder only supports the Brown tagset,
which is the same tagset used by Brill. Users have two options: they
can install Brill on their computer and have TextLadder run Brill for them
(TextLadder also formats the texts for Brill prior to tagging and
afterwards renames the tagged files and puts them in their own folder),
or they can manually run the tagger and rename the files themselves. (If
the second option is chosen, users should contact me about exactly
how to go about doing this.)
SOWFI support
TextLadder 2.0 supports SOWFI classification. SOWFI stands for
"semantically opaque word family item". It refers to individual items
within word families whose meanings are not guessable on the basis of
the word family's base meaning, even though there is some kind of
semantic relationship connecting it. Examples of SOWFI's are "shortly"
within the word family "short", "governor" within the word family
"govern", "earthy" within the word family "earth", and "namely" within the
word family "name". It is important that TextLadder distinguish between
SOWFI's and other members of the word family, because one cannot
assume that a learner who has encountered another member of the
word family will subsequently be able to understand a SOWFI when
confronted with it. Therefore, although a SOWFI adds to the overall
frequency count of its word family during the TextLadder sorting
process, it is given its own appearance in the pre-reading section. (The
SOWFI classification system itself is described in more detail below.)
Different algorithm
TextLadder 2.0 uses a different algorithm from 1.0 when arranging
texts into a sequence list. This algorithm involves first trimming the
word lists to fit the particular corpus of texts being processed (note
that a separate trimmed list is created: the old lists remain intact).
The trimmed list still provides coverage that is above the reading
comprehension threshold (set at 95% by default), but contains fewer
low-frequency words. This means that sequence lists are shorter AND
that there are fewer low-frequency words included in the pre-reading list
than in version 1.0. Note that the actual number of low-frequency words
in the pre-reading lists can be decreased even further by changing one
of TextLadder's default values, but there will be a corresponding
increase in the number of texts on the sequence list. In other words,
there is a trade-off between the number of low-frequency words in the
pre-readings and the length of the sequence list.
User-changeable default values
There has been a change in the number and kind of default values
changeable by the user in TextLadder 2.0.
a) Minimum number of encounters
In version 1.0, you could change the minimum number of
encounters required from 5 to 6; in version 2.0 you can change it to
any value you want.
b) Maximum number of new words per text
In version 1.0, the maximum number of new words per text was set
permanently at 25. In version 2.0, this value can be changed by the user
to any value desired. As with version 1.0, however, there will be certain
cases (especially among the first few texts) where TextLadder will be
unable to respect the maximum limit.
c) Theoretical reading comprehension coverage threshold
In version 1.0, this threshold was set permanently at 95%. In version
2.0, the threshold can be changed to anything between 1 and
100%. However, users should be aware that they should not change
this value simply because their lists account for more than 95% of the
texts being processed: TextLadder will detect that automatically and
take it into account. Users should only change this value if they believe
the 95% value itself is theoretically unsound.
d) Lowest number of repetitions of low frequency words
This value relates to the algorithm used by TextLadder 2.0 to
create the sequence list. Once all high frequency words in the corpus
have been encountered among texts already on the sequence list,
TextLadder switches gears and begins looking for texts with the highest
number of repetitions of previously-encountered low-frequency words.
The "lowest number of repetitions of low frequency words" value tells
TextLadder when to stop looking for these repetitions. E.g. if the value is
set at 20 (the default value), TextLadder will stop the process of adding
texts to the sequence list once the number of repetitions of low-
frequency words per text drops below 20.
e) Build a sequence list vs. Create list of high-frequency non-list words
By default this value is set to "Build a sequence list" in TextLadder
2.0. However, the user has the option to select the other option,
which tells TextLadder not to build a sequence list but instead look for
the highest frequency NON-list words. Such a list could be useful to a
user wanting to put together a "rough and ready" domain specific list.
For example, if a user wanted to put together a very rough geology-
related domain-specific list, s/he could collect together a group of
academic texts related to geology, then open TextLadder, select this
option, choose Nation's 2,000-most-frequent-words-list and Coxhead's
Academic Word List as the coverage lists, and run TextLadder on the
texts. The user could then see what high-frequency non-list words
emerge, and use them as the basis for a very rough domain-specific
word list. Note that TextLadder provides not just frequency but also
range information about these non-list words, so the user could avoid
including a word that occurred 15 times in only one text in her domain-
specific list.
f) Include non-list words and very low frequency words in pre-reading
list
By default this value is set to "No", meaning that only list words will
appear in the pre-reading list (not necessarily all the list words in each
text, but rather only as many as is necessary to reach the reading
comprehension threshold, set at 95% token coverage by default). Set
this value to "Yes" if you want EVERY unfamiliar word in each text listed
in the pre-reading list.
g)Processing Options: "Standard Processing" vs. "SubGroups Restriction
Processing" vs. "Pre-Set Order Processing"
* Standard processing: TextLadder arranges texts into the best
possible order for vocabulary introduction, keeping the total number
of texts to be read as short as possible.
* Subgroups restriction: TextLadder accepts a restriction
on its processing: it does not separate texts within a subgroup.
Within the subgroups option, there are various sub-options
the user can choose among:
- The user has the option either to allow TextLadder to arrange the
order of the subgroups themselves as it sees fit, or to arrange
the subgroups according to a pre-set order.
- The user has the option of specifying how many texts in the
subgroups should appear in the final output. Thus, a user could,
for example, input subgroups with 10 or more texts each, but
specify that only 4 texts from each subgroup should appear on the
final sequence list. This would mean that TextLadder would choose
the best 4 texts from each subgroup.
** As a variant on this last option, the user would
be able to specify the number of texts from a subgroup that
should appear on the final sequence list for each individual
subgroup. Thus, while the number of texts specified for one
subgroup might be 4, the number specified for another might be 7,
and so on.
- The user has the option to either have TextLadder eliminate all
texts among the subgroups that do not pass the reading comprehension
threshold (which is one of the defaults that may be changed, though
originally set at 95% -- see above), or not to eliminate these texts.
During standard processing, texts that do not pass the threshold
are automatically eliminated, but a user who wants subgroup texts
to be included even when they don't pass the threshold now have
that option.
* Pre-set order option: TextLadder will do no arranging whatsoever
but will leave the texts in the exact order entered by the user.
Within the pre-set order option:
- As with the subgroups, the user has the option either to have
TextLadder eliminate texts which do not pass the reading
comprehension threshold or not.
h) Option to "Create HTML versions of text files and pre-reading list"
The user has the option to have TextLadder create HTML versions of
the text files being processed as well as the outputted pre-reading
list. The words in these HTML versions will be linkable to an online
dictionary website. Students reading the texts or pre-reading lists
will therefore be able to click on words they don't understand and
get an instant dictionary definition or translation.
NOTE: TextLadder does not supply hyperlink references for any website
dictionaries. It it up to the user to research such websites and
supply the reference in the box TextLadder provides. Also, it is the
user's responsibility (where necessary) to contact the website and
ask permission to link to them.
The outputted HTML pre-reading list will appear in the TextLadder
program subfolder with the rest of the output files. The HTML
versions of the text files will appear in a subfolder called
"TextLadderHTML", which will be located in the same folder (or
folders) that contain the user's text files.
i) Proper nouns
Note that the option to not include proper nouns as familiar words
when calculating token coverage, was included in version 1.0 but is not
included in version 2.0. This means that proper nouns are always
considered "familiar items" by TextLadder. I may add this option back
again, depending on the demand for it.
Old Features from Version 1.0
Rule-based parsing engine
TextLadder still uses the same rule-based parsing engine to match
words not on the selected word lists to words from the same word
families that ARE on the word lists. For example, if "tax" is included in
the word list, TextLadder's parsing engine will be able to match "taxes",
"taxation", and "untaxed" to it, even though these three words are not on
the word list. Of course, there are many exceptions in the English
prefixing and suffixing system, and TextLadder has been trained on the
Brown corpus with Nation's 2,000 high frequency word list and
Coxhead's AWL in order to learn to detect these exceptions. However,
its accuracy rate is not 100%, particularly for word families
outside these 3 lists. Still, from what I have seen so far its accuracy
is pretty good, perhaps because many of the most "problematic" items
(in terms of orthographic irregularities) are from (or bear a resemblance
to) high-frequency Anglo-Saxon word families of the kind found in
Nation's list.
Proper noun detection
Built into the above engine is the ability to distinguish between words
in a text are that truly proper nouns and words which only appear to be
proper nouns (e.g. because they're at the start of the sentence). As with
the suffix and prefix-based predictions, accuracy is not 100%.
Production of output
The same pieces of output are generated: Screening output, Sequence
List output, Included Low-Frequency Words output, and Final Pool
Contents output. (See the article for details on these.) The one
optional addition is the list of high-frequency non-list words
(described in the "User-changeable default values" section above).
Word Lists
Description
The following lists are modified versions of Paul Nation's thousand
most frequent word families of English list and second thousand most
frequent word families of English list, and Averil Coxhead's Academic
Word List. The lists have been modified for use with TextLadder in the
following way: they list the word family headwords and, under each
headword, include only particular word family items that are
orthographically irregular or semantically opaque. The orthographically
irregular items are listed because the TextLadder parsing engine can
only successfuly predict orthographically regular items, and the
semantically opaque items are listed so that TextLadder can be
SOWFI-sensitive.
Note to users: nobody should confuse these lists with the original
lists (Nation's lists are viewable online at the following address:
www1.harenet.ne.jp/~waring/Wordlists/vocfreq.html;
Coxhead's list is viewable at www.vuw.ac.nz/lals/div1/awl/)
The lists available below have been modified for use with
TextLadder or TextLadder-like parsing engines, and are meant to be
used as such. If you have questions regarding these modified lists
(e.g. questions about asterisk placement, why certain items are
considered orthographically irregular, etc.), please DO NOT address
them to the authors of the original lists.
The SOWFI classification system
The below lists also act as a model to users attempting to put together
their own SOWFI-ready lists for use with TextLadder. For those of you
who fall into this category, or who are simply curious about what all
the asterisks in the list mean, I'm going to take a quick moment to
explain the SOWFI classification system.
When two or more words within a word family have the same number
of asterisks, this means that they are "linked SOWFI's." For example,
the word family "determine" is listed as follows:
determine
determinate *
indeterminate *
Both "determinate" and "indeterminate" are considered linked SOWFI's;
i.e. having encountered one means one should be able to guess the
meaning of the other.
When words within a word family have different numbers of asterisks,
they are separate SOWFI's. For example, the word family "technical" is
listed as follows:
technical
technically *
technicality **
"Technically" and "technicality" are considered separate SOWFI's; i.e.
they will each appear on their own in the Pre-Reading list (under the
section reserved for SOWFI's from previously encountered word
families, called "Old Words with New Meanings").
A personal note about the process of identifying SOWFI's:
I cannot overemphasize how subjective the process of identifying
SOWFI's is. There will definitely be items on the below lists
identified as SOWFI's that you think shouldn't be, and perhaps some
items not identified as SOWFI's that you think should be. In the end
I decided to be very liberal in identifying SOWFI's, placing asterisks
next to almost anything with the potential for semantic opacity. I felt
that there was less harm in overidentifying than underidentifying, since
if the item isn't semantically opaque, the reader will simply skip over it
during the Pre-Reading section. Still, please feel free to remove
asterisks that you feel are unwarranted from your copy of these lists.
Download
You can download the Lists just below. Included with the three other
lists in the zip file is a very rough copy of an Economics domain-
specific list I've generated on the basis of information from 261
Economics-related news texts. I'm supplying it here so that TextLadder
users have at least one domain (Economics) for which the lists in
this package provide 95% coverage (in conjunction with proper nouns).
Download lists
(Feb. 22, 2002: A fifth list, consisting of high-frequency word families
found in Voice of America simplified news texts, is available here
and is useful for anyone looking to get 95% coverage of Voice of
America Special English texts.)
Important Notes
1. Processing Time
Once you choose the texts you want to process and the coverage lists
you'd like TextLadder to use, TextLadder will load some files and then
begin the screening process. Once the screening process is finished,
TextLadder will tell you how many texts passed the screening and ask
you if you want to continue. If you click "Yes", TextLadder will begin
the actual sorting which takes a fairly long time - anywhere from 15
minutes to 2 hours, depending on the number of texts and the speed of
your computer. (Note that I'm referring to the "Build sequence list"
option here, not the "Generate list of high-frequency non-list words"
option, which goes a lot faster.)
2. Windows XP and Windows 2000 vs. Windows 95, 98 and Me
In versions of TextLadder prior to 2.10 (i.e. versions 2.00 through
2.09) TextLadder ran 5 to 7 times slower on Windows 98/Me than
on Windows 2000/XP. However, with versions 2.10 and later there
is only a small difference between TextLadder's performance
on Windows 2000/XP and its performance on Windows 98/Me
(TextLadder runs approximately 1.5 times slower on 98/Me,
all other things being equal).
3. Windows XP display bug
Windows XP users may find that TextLadder's display window
freezes at some point. However, TextLadder is in fact running
perfectly: it is is simply unable to refresh the display window. To
confirm that TextLadder is in fact still running, go into the TextLadder
program folder, then into the Output subfolder, and click on the file
"LastUpdated.txt". This file is updated by TextLadder at regular
intervals: if the time listed in this file is less than 5 to 10 minutes
earlier, then you know that TextLadder is running OK.
4. Windows XP part-of-speech tagging
Please note that the version of Brill that was compiled for Windows/
DOS using djgpp does not work in Windows XP. Instead, Windows XP
users must download the version of Brill that was compiled for
Windows using MinGW32. Both versions of Brill are available for
download here.
5. The content of ScreeningOutput.txt
When TextLadder tells you how many texts passed the screening and
asks you if you want to continue, you can check the file
ScreeningOutput.txt for more details about the results of the
screening process (e.g. the exact coverage figures for each text).
In versions of TextLadder prior to 2.15, the content of ScreeningOutput.txt
changed over the course of the processing (first giving coverage figures
vis-a-vis the original word lists, then giving coverage figures
vis-a-vis the trimmed word list). In version 2.15 and subsequent versions,
ScreeningOutput.txt (and its co-product, ScreeningOutputNotPassed.txt)
gives coverage figures only vis-a-vis the original word lists.
6. Length and number of texts processed by TextLadder
It is recommended that you use texts not much longer than 3000
words if you want to avoid having texts excluded during the screening
process for being too long. Also, note that you can have as many as
750 texts for processing.
For the "compile a list of high-frequency non-list words" option, texts
can be as long as 2,250,000 words. However, certain restrictions apply:
all the words in all the texts cannot exceed this 2,250,000 word-limit.
Also, it is better to have either a few very large texts or many small
texts, but not both together.
7. Mid-sentence line-breaks
If you plan to use the part-of-speech tagging option, make sure that the
texts you are using do not contain mid-sentence line-breaks. Mid-
sentence line breaks usually occur when a text is copied-and-pasted,
e.g. off the Internet. To avoid this (if you are using texts from the web),
save the html page in text format, rather than copying-and-pasting.
8. Preparing texts for processing
A final cautionary note: always remember to edit anything you do not
want included in your texts out of the texts before processing.
Otherwise, TextLadder will pick it up and include it as part of the
analysis.
9. Recommended number of texts (for pedagogical purposes)
Although this is a somewhat arbitrary number, I would recommend 200
as the minimum number of texts you should aim to have pass the cut-
off, if you are using TextLadder to compile a sequence list/pre-reading
list/list of included low-frequency words for pedagogical use. Of course,
the outputted sequence list will obviously include far fewer than 200
texts, but having more than 200 texts for TextLadder to choose from will
improve the quality of your sequence list. In practice, to get 200 texts to
pass the cut-off, you should be looking to collect between 200 and 300
texts for TextLadder to analyze.
Home | Software | Documentation | Sample Output | Bug Reports | FAQ
Message Board | Contact Me
Template components provided by WEBalley
"Black Chancery" fonts in title from Stuff.uk.com
"Books in wheelbarrow" from unknown public domain source.
© Sina Ghadirian 2001