Measures of Lexical Diversity (LD)

On this page, you can calculate a number of measures of lexical diversity. For each measurement, a reference is given. If you want to read about the why and how of these calculations, you should look up the references.

Input


Options



(min. 100, max. 10000)

* The sample texts are the first chapter of George Orwell's Animal Farm, Sir Arthur Conan Doyle's A Scandal in Bohemia and the first chapter of Charles Darwin's On the Origin of Species. See References for Gutenberg links.

Please take note of the pre-processing (i.e. before calculation) done here:

  • All punctuation (e.g. !?.,-) is removed, as are other non-standard characters (e.g. \/*, non-UTF-8 quotes).
  • Text between square and angle brackets (e.g. [some text], <p>) is removed.
  • All numerical characters (0-9) are removed.
  • All tabs, newlines (breaks) double spaces et cetera are removed. Remaining spaces are used as word boundaries and they are not counted.
  • All letters are converted to lower case (so 'Speak', 'SPEAK' and 'speak' are treated as one type and three tokens).

Output

Tokenization Please analyze a text first text split into words
Frequency list Please analyze a text first top 40 of most frequent words
Tokens ... total number of words
Types ... number unique of words
Type-token-ratio (TTR) ... number of types divided by number of tokens
Mean word frequency (MWF) ... number of tokens divided by number of types
Mean segmental TTR (MSTTR) ... mean ttr for text segmented into chunks of 100 words; only for text longer than 100 words
Guiraud's Index ... Guiraud (1954)
Herdan's C ... Herdan (1960, 1964)
Yule's I ... see Gries 2004: XXX
Yule's K ... see Yule 1944; Oakes 2004: 203-5
Maas's a2 ... see Maas 1972; Tweedie &Baayen 1998; Treffers-Daller 2013
Dugast's U2 ... see Dugast 1978, 1979
Measure of textual lexical diversity (MTLD) ... see McCarthy & Jarvis 2010
Processing time ... Yes, a script like this takes only milliseconds.

Frequency list

Below you can see a list of the most frequent words. Most of the times, the top of the list is occupied by some function words (determiners, general verbs et cetera). The list has a maximum of words and is sorted from most to less frequent.

Tokenized text

Below you can see the text in tokenized form. This means that each numbered item should represent a word. The tokenized text is the main ingredient for all analyses of lexical diversity, but tokenization is not always perfect. Therefore, I consider it a good habit to inspect the tokenized text.