Website van Alex Reuneker over taal, hardlopen, wielrennen en reizen

N-gram generator now capable of logographic n-grams

As per request, I have added an option to the n-gram generator to treat text not solely as word-based, but also as character-based/logographic text. This enables analyzing texts in, for instance, Mandarin Chinese.

Logographic n-grams in the n-gram generator

Logographic n-grams in the n-gram generator

As my knowledge of character-based languages is virtually non-existent, the feature is rudimentary at this moment, although two researchers of Mandarin Chinese did test the tool and evaluate the output. With respect to Chinese, a limitation pointed out to me by Maarten Bogaards, is that the character-based script does not have spaces, and the calculator basically treats each individual character as one word, even though words can consist of multiple characters. A Chinese sample text to test with can now also be loaded, and is a sample of Mei Yuan's 隨園詩話, taken from Project Gutenberg.

One of the things I added after testing was the removal of following punctuation marks, which are different in Chinese, namely 。,、“‘’”《》…·:?!;()and ,. You might not see the difference in all marks, but they are non-utf-8 counterparts, which, for computers at least, are a different beast. You can also enter additional characters to exclude if you so wish.

Bug fixes and new feature n-gram generator

in Taal & Literatuur by ( )

Unfortunately, due to work on large-file loading, some bugs slipped in, causing the n-gram generator to present incorrect results. Luckily, one of the users attended me to this problem, and the last few days I have fixed a number of related bugs. Atop that, I have implemented a number of checks to prevent really incorrect results in the future.

Finally, I have added n option to remove possessive 's, so now you can choose whether you’d like ‘Harry’s’ to be counted as ‘Harrys’ or ‘Harry’. Some general statistics (word totals, TTR) were added to.

To try the new version, head over to https://www.reuneker.nl/files/ngram.

Updates for the N-gram generator

Once in a while I receive emails from researchers all over the world with thanks and/or suggestions for the scripts I provide online, such as frequency list and n-grams generators. For this latter tool, I had a nice email conversation with a researcher from overseas, which led to the following enhancements and updates. I really enjoy these kinds of things, so if you have any suggestions or feedback – you know where to find me.

  • Slight efficiency rewrite of output rendering. (2024-01-26)
  • Added feature for respecting or ignoring sentence boundaries. (2024-01-25)
  • Added feature for including or excluding numbers. (2024-01-25)
  • Added top limits above 1.000 (2.000, 3.000, 4.000, 5.000, 10.000) to respect or ignore sentence boundaries. (2024-01-25)
  • Added feature for (virtually) unlimited results. (2024-01-22)
  • Added feature for unigrams. (2024-01-22)