Lexical Diversity Measurements

On this page, you can calculate a large number of measures of lexical diversity. For each measurement, a reference is given. If you want to read about the why and how of these calculations, you should look up the references. If you use this tool for your research, please cite it as follows.

Reuneker, A. (2017). Lexical Diversity Measurements. Retrieved 04 July, 2025, from https://www.reuneker.nl/files/ld.

For other calculation tools (such as ngrams, wordlists and keyword analysis), and for contact details, see https://www.reuneker.nl.

Input

* The sample texts are the first chapter of George Orwell's Animal Farm, Sir Arthur Conan Doyle's A Scandal in Bohemia and the first chapter of Charles Darwin's On the Origin of Species. See References for Gutenberg links.

Please take note of the pre-processing (i.e. before calculation) done here:

All punctuation (e.g. !?.,-) is removed, as are other non-standard characters (e.g. \/*, non-UTF-8 quotes).
All tabs, newlines (breaks) double spaces et cetera are removed. Remaining spaces are used as word boundaries and they are not counted.
All letters are converted to lower case (so 'Speak', 'SPEAK' and 'speak' are treated as one type and three tokens).

Output

Tokenized text

Below you can see the text in tokenized form. This means that each numbered item should represent a word. The tokenized text is the main ingredient for all analyses of lexical diversity, but tokenization is not always perfect. Therefore, I consider it a good habit to inspect the tokenized text.

Frequency list

Below you can see a list of the most frequent words. Most of the times, the top of the list is occupied by some function words (determiners, general verbs et cetera). The list has a maximum of words and is sorted from most to less frequent.

Hapax legomena

Below you can see a list of words that occur only once (hapax legomena).

Dis legomena

Below you can see a list of words that occur twice (dis legomena).

References

Darwin, Charles (1859). On the Origin of Species. John Murray Publishing. http://www.gutenberg.org/cache/epub/2009/pg2009.txt

Doyle, A. Conan (1892). The Adventures of Sherlock Holmes. George Newnes Publishing. http://www.gutenberg.org/cache/epub/1661/pg1661.txt

Johnson, W. (1944). Studies in Language Behavior: I. A program of research. Psychological Monographs, 56, 1-15. http://psycnet.apa.org/journals/mon/56/2/1/

McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381-392. https://doi.org/10.3758/BRM.42.2.381

Orwell, George (1945). Animal Farm. London: Secker and Warburg. http://gutenberg.net.au/ebooks01/0100011.txt

About

This page was created by Alex Reuneker. If you use the calculations presented here, please refer to Reuneker (2017) in your work. For the full APA reference and contact details, see the top of this page.

Data and privacy

The text you enter are sent to the server via https, processed on the same server, and results are sent to your browser. Text or any other data are not stored and are deleted after the calculations have been performed.

Change log

2025-06-12: Added mean sentence length (and sd) as well as mean word length (and sd).

2025-06-11: Lempel–Ziv–Welch compression algorithm implemented, resulting in the LZW compression rate.

2025-06-05: Compression algorithm implemented using zlib, resulting in a standard compression rate.

2025-06-05: Small paragraph on data and privacy added.

2025-05-29: Added choice to use natural logarithm or base 10 in calculation Maas's a², Dugast's U², and Herdan's C.

2025-05-29: Various improvements to calculations and algorithms; added MATTR.

2025-05-26: Important change to the calculation of MTLD, which was slightly off before due to not averaging the forward and backward algorithm.

2025-03-03: Added feature to count hapax legomena.

2024-06-11: Added feature to strip all text between square brackets ([ and ]).

2022-04-01: Romuald Dalodiere from Université de Mons reported a problem with the calculation of Maas's a². After some searching, the wrong logarithm was used. This was fixed on the same day, and the same fix was applied to the calculations of Herdan's C and Dugast's U. A function to report more decimals was added as well. Thank you for your feedback, Romuald.

Tokenization	...	text split into words
Frequency list	...	all words and their frequencies
Tokens	...	total number of words
Types	...	number unique of words
Average sentence length	...	average number of words per sentence
Hapax legomena	...	number of words occurring only once Please analyze a text first
Dis legomena	...	number of words occurring twice Please analyze a text first
Type-token-ratio (TTR)	...	number of types divided by number of tokens
Mean word length (MWL)	...	mean number of characters per word
Mean word frequency (MWF)	...	number of tokens divided by number of types
Mean segmental TTR (MSTTR)	...	mean ttr for text segmented into chunks of 100 words; only for text longer than 100 words; segment length set to 25 (Johnson 1944)
Moving average TTR (MATTR)	...	mean ttr for successive windows of a text; window size set to 25 words (Covington 2007; Covington & McFall 2010)
Guiraud's Index	...	Guiraud (1954)
Herdan's C	...	Herdan (1960; 1964)
Yule's I	...	Yule (1944); Gries (2004)
Yule's K	...	Yule (1944); Oakes (2004, pp. 203-5)
Maas's a²	...	Maas (1972); Tweedie & Baayen (1998); Treffers-Daller (2013)
Dugast's U²	...	Dugast (1978; 1979)
Measure of textual lexical diversity (MTLD)	...	McCarthy & Jarvis (2010)
Compression rate (GZ)	...	The deflate function from zlib is used instead of gzcompress, because the latter adds headers which penalizes shorter texts more severely. The compression rate (0-1) is calculated by subtracting the length of the compressed text by the length of the original, uncompressed text from 1. The higher the rate, the higher the compression, meaning the text contains less repetition — and vice versa.
Compression rate (LZW)	...	The Lempel–Ziv–Welch algorithm (Welch, 1984) is used. The compression rate is calculated by subtracting the length of the compressed text divided by the length of the original, uncompressed text from 1. The higher the rate, the higher the compression, meaning the text contains more repetition — and vice versa.
Processing time	...	Yes, a script like this takes only milliseconds.