Measures of Lexical Diversity (LD)
On this page, you can calculate a number of measures of lexical diversity. For each measurement, a reference is given. If you want to read about the why and how of these calculations, you should look up the references.
Input
* The sample texts are the first chapter of George Orwell's Animal Farm, Sir Arthur Conan Doyle's A Scandal in Bohemia and the first chapter of Charles Darwin's On the Origin of Species. See References for Gutenberg links.
Please take note of the pre-processing (i.e. before calculation) done here:
- All punctuation (e.g. !?.,-) is removed, as are other non-standard characters (e.g. \/*, non-UTF-8 quotes).
- Text between square and angle brackets (e.g. [some text], <p>) is removed.
- All numerical characters (0-9) are removed.
- All tabs, newlines (breaks) double spaces et cetera are removed. Remaining spaces are used as word boundaries and they are not counted.
- All letters are converted to lower case (so 'Speak', 'SPEAK' and 'speak' are treated as one type and three tokens).
Output
Tokenization | Please analyze a text first | text split into words |
---|---|---|
Frequency list | Please analyze a text first | top 40 of most frequent words |
Tokens | ... | total number of words |
Types | ... | number unique of words |
Type-token-ratio (TTR) | ... | number of types divided by number of tokens |
Mean word frequency (MWF) | ... | number of tokens divided by number of types |
Mean segmental TTR (MSTTR) | ... | mean ttr for text segmented into chunks of 100 words; only for text longer than 100 words |
Guiraud's Index | ... | Guiraud (1954) |
Herdan's C | ... | Herdan (1960, 1964) |
Yule's I | ... | see Gries 2004: XXX |
Yule's K | ... | see Yule 1944; Oakes 2004: 203-5 |
Maas's a2 | ... | see Maas 1972; Tweedie &Baayen 1998; Treffers-Daller 2013 |
Dugast's U2 | ... | see Dugast 1978, 1979 |
Measure of textual lexical diversity (MTLD) | ... | see McCarthy & Jarvis 2010 |
Processing time | ... | Yes, a script like this takes only milliseconds. |
Frequency list
Below you can see a list of the most frequent words. Most of the times, the top of the list is occupied by some function words (determiners, general verbs et cetera). The list has a maximum of words and is sorted from most to less frequent.
Tokenized text
Below you can see the text in tokenized form. This means that each numbered item should represent a word. The tokenized text is the main ingredient for all analyses of lexical diversity, but tokenization is not always perfect. Therefore, I consider it a good habit to inspect the tokenized text.