N-gram generator

You can use this page to generate n-grams for your texts. This site was made by Alex Reuneker. For a little demonstration and how to, see https://youtu.be/4aRD_NwuHW4. For questions, see contact details at https://www.reuneker.nl. If you use this site for your research, please cite it as follows.

Reuneker, A. (2019). N-gram generator. Retrieved ..., from https://www.reuneker.nl/ngram.


Steps

  1. Copy a text (from a website, a book, a larger corpus, et cetera).
  2. Paste the text into the input area below. You don't have to remove weird characters, tags, white spaces and new lines — the script does it for you.
  3. Set 'ngram' to the desired number of words or leave at 2 (bigrams) and set the number of results wanted (or leave at 50). If you're going to sort on probablity (see 'explanation'), it can be useful to set a minimal frequency for the n-grams included in the list. Sentence boundaries are respected by default, meaning that a sequence like '[...] list. Sentence [...]' above is not processed. If you choose to ignore sentence boundaries, the example would count a the bigram 'list sentence'.
  4. Click 'Generate ngrams' and wait a bit.

Input and settings

Choose preferred settings, or leave at default.

Paste a text to analyze below.



Results

Results will be presented here after you clicked 'generate n-grams'... Please realise that large texts may take a while and may cause your browser to slow down. In such cases, it is wise to not set 'results per page' to 'unlimited'.

About

The ngram function used was written using Vanilla Javascript, and your text is not uploaded to any server. Your computer itself does all the work. Small texts are processed very quickly. Longer texts take a bit longer, although getting all bigrams from the King James Bible (4.2MB; almost 800.000 words) took my laptop three seconds with multiple tabs and other applications open. Please note that retrieving a long or (virtually) unlimited list of results may slow down or even crash your browser due to memory limitations of your computer and browser architecture.

Short explanation

An n-gram is just a sequence of n words. So, 'cat' is a unigram, 'my cat' is a bi-gram, and 'my sleepy cat' is a tri-gram. There's more interesting things to know about n-grams. For a short explanation of frequencies, probabilities and my (rather unusual) 'strength' measure, click here.

Updates

Update: Added option to exclude not only numbers, but also words containing numbers. (2024-01-29)
Update: Slight efficiency rewrite of output rendering. (2024-01-26)
Update: Added feature for respecting or ignoring sentence boundaries. (2024-01-25)
Update: Added feature for including or excluding numbers. (2024-01-25)
Update: Added top limits above 1.000 (2.000, 3.000, 4.000, 5.000, 10.000) to respect or ignore sentence boundaries. (2024-01-25)
Update: Added feature for (virtually) unlimited results. (2024-01-22)
Update: Added feature for unigrams. (2024-01-22)
Update: Added feature to export results to CSV. (2022-08-31)
Update: Added feature to exclude words before processing text. (2022-08-31)
Update: Fixed newline problem resulting in n+1-grams. (2021-01-17)
Update: Added strength measure. (2019-12-11)
Update: Added conditional probabilities. (2019-12-10)