lexical diversity - Alex Reuneker

Sorted frequency table for Hirsch-Popescu Point added to Lexical Diversity Calculator

As I was working on a very brief piece on the Hirsch-Popescu Point (HPP) in one of the chapters of Reve's The Evenings, it occurred to me that the Lexical Diversity Calculator does calculate the Hirsch-Popescu Point, but that it didn't yet offer the option to actually look at the sorted frequency table used for determining the HPP. As that table can be very informative, I now implemented the displaying of it in Lexical Diversity Calculator. The actual HPP, so the word which has a position in the sorted frequency list that matches it frequency, is marked in bold and red for easy identification.

HPP marked in bold and red

HPP marked in bold and red

If you'd like to use it, just head over to https://www.reuneker.nl/files/ld.

Hirsch-Popescu Point added to Lexical Diversity Calculator

11 December 2025 in Taal & Literatuur by Alex (Hirsch-Popescu Point H-P Point lexical diversity repetition metric linguistics)

The Hirsch-Popescu Point (Popescu & Altmann, 2006) is an interesting metric to assess repetition in a text. It is determined by first calculating the frequency distribution of all words in the text. Then, words are ranked from the most frequent to the least frequent. The H-P Point is then defined as 'the point in which the ranking of a word in the distribution matches its frequency, just like the h-index in academia' (see Nunes, Ordanini, Valsesia, 2017, p. 20; Hirsh, 2005). Indeed, the h-index is a well-known measure of productivity and citation impact of publications. The smaller the H-P point, the less repetition a text contains and vice versa, i.e, the greater the HP-point, the more repetition a text contains.

If we apply the calculation to an example text, we easily see how it works exactly. The text used here is Sylvia Plath’s poem 'Lady Lazarus' (1965), and the resulting frequency table (distribution of words) can be seen below.

Word distribution in https://www.poetryfoundation.org/poems/49000/lady-lazarus”>Sylvia Plath’s poem ‘Lady Lazarus’ (1965)

Word distribution in Sylvia Plath’s poem ‘Lady Lazarus’ (1965)

In this table, we see that the word of occurs eight times in the poem, and it is also ranked at position eight in order from most to least frequent words. Therefore, 8 is the H-P Point. Of course, this frequency table, the sorting and determining the actual point at which frequency and order coincide is done by the Lexical Diversity Calculator for you, as can be seen below.

The Hirsch-Popescu Point as calculated by the Lexical Diversity Calculator

The Hirsch-Popescu Point as calculated by the Lexical Diversity Calculator

As always, if you find it useful, have fun!

MATTR added to the Lexical Diversity Calculator

04 June 2025 in Taal & Literatuur by Alex (mattr ttr types tokens lexical diversity calculator)

Last week, I implemented the calculation of MATTR (Moving Average TTR) into the Lexical Diversity Calculator. MATTR calculates the mean TTR for successive windows of a text (Covington & McFall, 2010), getting, at least that is the idea, a more stable indication of lexical diversity. While that’s not entirely the case (see Bestgen, 2025), you can still test it at https://www.reuneker.nl/ld.

enter image description here

Photo by Sean Nufer on Unsplash

Next: implementing a compression-rate measure to operationalize text repetiveness for what hopefully becomes a project together with Vivien Waszink!

Improvements to the Lexical Diversity Calculator

29 May 2025 in Taal & Literatuur by Alex (lexical diversity calculator MATTR)

In the last couple of days, I've been implementing various improvements to the Lexical Diversity Calculator. Not only did I fix a problem in the calculation of MTLD, which resulted in numbers that were slightly off, but I've also streamlined the calculations and added the calculation of Moving average TTR (MATTR).

enter image description here

Photo by Siora Photography on Unsplash.

Updates

2025-05-29: Added choice to use natural logarithm or base 10 in calculation Maas's a2, Dugast's U2, and Herdan's C.
2025-05-29: Various improvements to calculations and algorithms; added MATTR.
2025-05-26: Important change to the calculation of MTLD, which was slightly off before due to not averaging the forward and backward algorithm.

Next to this, I'm also working on an R-package to easily calculate several measures of lexical diversity, primarily for a research project I'm envisioning for the near future. Stay tuned! For now, please see the online calculator at https://www.reuneker.nl/ld for the newest version.

Hapax Legomena added to Lexical Diversity tool

03 March 2025 in Taal & Literatuur by Alex (Hapax Legomena Lexical Diversity)

In mailing back and forth with one of the researchers over at the Max Planck Institute, there was some confusion over the use of the term unique words in the Lexical Diversity tool. Unique words are not hapax legomena, which is the term in corpus linguistics for words that only occur once. Unique words are simply types and count up to the number of different words in a text. A word might occur once, twice or twenty times, but in all three cases, it would count as one unique word. This measure is also used for calculating the type-token-ratio. As the researcher was interested in how many words occur only once in a text, I've added this count. You can use the new feature here right away!

enter image description here

Hapax legomena in the Lexical Diversity tool