Measures of Lexical Diversity (LD)
On this page, you can calculate a number of measures of lexical diversity. For each measurement, a reference is given. If you want to read about the why and how of these calculations, you should look up the references.
* The sample texts are the first chapter of George Orwell's Animal Farm, Sir Arthur Conan Doyle's A Scandal in Bohemia and the first chapter of Charles Darwin's On the Origin of Species. See References for Gutenberg links.
Please take note of the pre-processing (i.e. before calculation) done here:
- All punctuation (e.g. !?.,-) is removed, as are other non-standard characters (e.g. \/*, non-UTF-8 quotes).
- Text between square and angle brackets (e.g. [some text], <p>) is removed.
- All numerical characters (0-9) are removed.
- All tabs, newlines (breaks) double spaces et cetera are removed. Remaining spaces are used as word boundaries and they are not counted.
- All letters are converted to lower case (so 'Speak', 'SPEAK' and 'speak' are treated as one type and three tokens).
|Tokenization||Please analyze a text first||text split into words|
|Frequency list||Please analyze a text first||top 40 of most frequent words|
|Tokens||...||total number of words|
|Types||...||number unique of words|
|Type-token-ratio (TTR)||...||number of types divided by number of tokens|
|Mean word frequency (MWF)||...||number of tokens divided by number of types|
|Mean segmental TTR (MSTTR)||...||mean ttr for text segmented into chunks of 100 words; only for text longer than 100 words|
|Guiraud's Index||...||Guiraud (1954)|
|Herdan's C||...||Herdan (1960, 1964)|
|Yule's I||...||see Gries 2004: XXX|
|Yule's K||...||see Yule 1944; Oakes 2004: 203-5|
|Maas's a2||...||see Maas 1972; Tweedie &Baayen 1998; Treffers-Daller 2013|
|Dugast's U2||...||see Dugast 1978, 1979|
|Measure of textual lexical diversity (MTLD)||...||see McCarthy & Jarvis 2010|
|Processing time||...||Yes, a script like this takes only milliseconds.|
Below you can see a list of the most frequent words. Most of the times, the top of the list is occupied by some function words (determiners, general verbs et cetera). The list has a maximum of words and is sorted from most to less frequent.
Below you can see the text in tokenized form. This means that each numbered item should represent a word. The tokenized text is the main ingredient for all analyses of lexical diversity, but tokenization is not always perfect. Therefore, I consider it a good habit to inspect the tokenized text.