Website van Alex Reuneker over taal, hardlopen, wielrennen en reizen

Lexical coverage added to Lexical Diversity Tool

I added a measure (somewhat) known as 'lexical coverage' to the Lexical Diversity Tool. This measure represents the percentage of words that occur in a list words from all Dutch newspaper texts in the SoNaR-500 corpus that, together, make up for 77 percent of all tokens in that corpus (although other corpora are used, see Staphorsius, 1994; Kraf, Lentz & Pander Maat, 2011). The higher this percentage, the easier the text, because more words may be supposed to be read before and thus 'known'. Although this definitely says something about the lexical diversity (perhaps indirectly) of a text, it is used primarily to assess the reading difficulty of a text (see also Adolphs & Schmitt, 2003; Van Zeeland & Schmitt, 2013).

Lexical coverage added to Lexical Diversity Tool

Lexical coverage added to Lexical Diversity Tool

Because I have used of the (Dutch newspaper subcorpus of the) SoNaR-500 as a reference corpus, the measure only works for Dutch texts – for now at least. Although the implementation is still a bit rough, it is workable and correct, but be aware it is still in development.

Lempel-Ziv-Welch-algoritme in de Lexical Diversity Calculator

In een vorige post over compressieratio's gaf ik aan dat ik, na een algemene compressieratio, ook de compressieratio op basis van het Lempel-Ziv-Welch-algoritme wilde implementeren, zoals dat wordt gebruikt in het onderzoek waarover dit Nature-artikel van Parada-Cabaleiro et al. (2024) gaat. Dat is inmiddels gelukt – de Lexical Diversity Calculator berekent nu dus ook de LZW-compressieratio.

Welch (1984)

Welch (1984)

Over het algoritme zelf kun je meer lezen op Geeks for Geeks of, als je echt zin hebt, in het originele artikel van Welch (1984). Natuurlijk kun je de de functie gelijk proberen met de Lexical Diversity Calculator.