corpus - Alex Reuneker

Referentiecorpus SoNaR-500 toegevoegd aan Keyword Analysis-tool

Gisteren en vandaag was ik bezig met een woordfrequentielijst van het SoNaR-corpus (Oostdijk et al., 2013). Die lijst heb ik nodig om de Lexical Diversity-tool uit te breiden, maar ik heb het SoNaR vast als referentiecorpus toegevoegd aan de Keyword Analysis-tool.

Je kunt nu dus kiezen om trefwoorden in je (Nederlandse) tekst op te sporen door de tekst te vergelijken met het toch wel oude en veel kleinere CONDIV-corpus, of met het SoNaR-corpus. Andere beschikbare referentiecorpora zijn het BNC voor het (Brits) Engels, een Nederlandstalig popcorpus en een eveneens Nederlandstalig rapcorpus. Je kunt uiteraard ook nog steeds zelf een referentiecorpus toevoegen – dat is makkelijker dan je wellicht denkt!

In de onderstaande afbeelding kun je zien dat bijvoorbeeld het woord herkomstlanden significant vaker voorkomt in het NOS-artikel Onderzoek: deel collectie Oranjes mogelijk onrechtmatig verkregen dan in het SoNaR-corpus en dus iets zegt over de het artikel; het is een trefwoord of keyword.

Trefwoorden in vergelijking met het SoNaR-corpus

Trefwoorden in vergelijking met het SoNaR-corpus

Opmerkingen bij deze toevoeging zijn dat alleen Nederlandse krantenteksten zijn gebruikt voor de frequentielijst en, met het oog op processing in JavaScript en bestandsgroottes, alleen woorden die tien keer of vaker voorkwamen zijn meegenomen.

Je kunt de uitgebreide tool uiteraard direct gebruiken op https://www.reuneker.nl/files/keyword.

Dealing with 'zero counts' in keyword analysis

30 December 2025 in Taal & Literatuur
keyword analysis zero counts target reference corpus

One problem with keyword analysis is that the target corpus will likely include words that do not occur in the reference corpus. In calculating various measures of keyness, this would result in a division by zero, which is mathematically impossible, as far as I know. The default way of dealing with this is to assign words that do not occur in the reference corpus a frequency of 0.5, but this introduces the risk of a result in which such keywords dominate the top positions, because their keyness is inflated.

To remedy this problem, I have added an option to the Keyword Analysis Tool which let's you choose to either go with the default of assigning a 0.5 frequency to 'zero counts', or to simply discard them from all calculations, resulting in keywords that have a minimal frequency of 1 in the reference corpus.

Dealing with 'zero counts' in the Keyword Analysis Tool

Dealing with 'zero counts' in the Keyword Analysis Tool

There is no real wrong or right way to do this, but at least now you have a choice. Have fun!

'From clue to culprit: epistemic conditionals in detective fiction' at ECA 2025, Warsaw

29 April 2025 in Taal & Literatuur
reasoning argumentation conditionals corpus detectives sherlock holmes

In September 2025, at the 5th European Conference on Argumentation (ECA), I will present a corpus study on reasoning with conditionals by detectives, entitled From clue to culprit: epistemic conditionals in detective fiction. The conference is hosted by the Warsaw University of Technology, Faculty of Administration and Social Sciences. Below you'll find the abstract. Hope to see you all there!

enter image description here

Abstract 'From clue to culprit: epistemic conditionals in detective fiction'

Random Text Sampler

20 March 2025 in Taal & Literatuur
text corpus sample random tool online

Soms is het handig om voor een vergelijkend onderzoek steekproeven (samples) van een bepaald aantal woorden uit een tekst te halen. Omdat dat typisch zo’n terugkerend klusje is waaraan ik elke keer toch weer meer tijd kwijt ben dan gedacht, heb ik er maar een online tooltje voor gemaakt.

enter image description here

Random text sampler

Het lijkt me zonde om dat voor mezelf te houden en daarom kan iedereen die dat wil op https://www.reuneker.nl/randsamples een tekst invoeren, het gewenste aantal steekproeven en de steekproefgrootte (in aantal woorden) selecteren en met een druk op de knop de samples tevoorschijn toveren. Je kunt daarbij ook aangeven dat je, per sample en voor het geheel, de type-token-ratio’s en MTLD-scores wilt zien.

Concreet was de aanleiding overigens een klein onderzoekje naar jeugdliteratuur ter illustratie van de t-toets-calculator voor studenten, dat je hier vindt: https://www.reuneker.nl/2025/03/zinslengte-in-de-brief-voor-de-koning-en-kinderen-van-moeder-aarde. Mocht je gewoon eens willen kijken hoe e.e.a. werkt, dan kun je gemakkelijk samples nemen uit Jules Vernes Twenty Thousand Leagues under the Sea of Louis Couperus' Stille Kracht, die je met een klik op de desbetreffende knop op het scherm tovert.

New publication in Argumentation: Assessing Classification Reliability...

07 April 2023 in Taal & Literatuur
linguistics conditionals argumentation inter-rater reliability corpus classification research

Different types and argumentative uses of conditionals (if-then) have been distinguished in the literature, but their applicability to actual language use is rarely evaluated.

As 'the proof of the pudding is in the eating', my new paper in Argumentation (Springer) entitled 'Assessing Classification Reliability of Conditionals in Discourse' addresses this issue by means of an experiment in which the inter-rater reliability of classifications applied to natural-language corpora was assessed.

enter image description here

New publication in Argumentation: 'Assessing Classification Reliability of Conditionals in Discourse'

You can find the paper (open access) in Argumentation here: https://rdcu.be/c9nO4.