A better Danish wordlist take one: word extraction

The texts I chose to base my work on is just below 4 GB in size and is split across 25,106 files. These plain-text files contain markup that allows the files to be used in concordances where you can see multiple instances of each word with its immediate context. Further markup includes word classes, frequency and others that are of no interest to us.

Our initial text sources include (source dsl.dk):

Korpus 90 – 32 million tokens of written Danish LGP gathered around 1990, ePOS-tagged and lemmatized
Korpus 2000 – 30 million tokens of written Danish LGP gathered around 2000, ePOS-tagged and lemmatized
Korpus 2010 – 45 million tokens of written Danish LGP gathered around 2010 as part of the DK-CLARIN Project, ePOS-tagged and lemmatized
ePAROLE – beta version of the Danish PAROLE corpus tagged with the ePOS tag set.
10000 most frequently used lemmas in Danish
Full-form lexicon: lemmas with inflected forms
Synonyms from The Danish Dictionary
The Danish FrameNet Lexicon
The Danish WordNet DanNet

In total over 100 million (non-unique) Danish words spanning a good part of 30 years. Given the different file formats and markup used for lexical analysis cleaning up the text and validating the output is non-trivial.

A few examples of some of the source datasets can be seen below:

kedelige	kedelige	_	kedelig	A	AC:sdu#:--:p---
pampaskat	pampaskat	.$	pampaskat	N	NC:siuc:--:----
</>

T	på	0.015317332743123

1871@{møblement_1; møblering_1}@de møbler som et rum el. flere rum er udstyret med (Brug: "Karnappens møblement er let og raffineret || stuerne havde ligget en suite, adskilt af fløjdøre og med møblementet placeret langs væggene"; "Eneste møblering er nogle røde kontorstole")@Furniture+Artifact+Object+Group@

Using a crude Python script, the Python Counter library and regular expressions I have extracted a first rough wordlist. The script is modified from a Stack Overflow article I didn’t bookmark. (I’d be happy to update this article with a link if someone takes the time to find it.)

from collections import Counter
from glob import iglob
import re
import os

def remove_garbage(text):
    text = re.sub(r'(\W|_|(\W\d+(?!\w)))+', ' ', text)
    text = text.lower()
    return text

topwords = 100000000
folderpath = 'C:/data'
counter = Counter()
for filepath in iglob(os.path.join(folderpath, '*.txt')):
    with open(filepath, encoding="utf-8") as file:
        counter.update(remove_garbage(file.read()).split())

f1 = open("C:/output/Danish-unique-words-with-frequency.txt", "w", encoding="utf-8")
f2 = open("C:/output/Danish-unique-words.txt", "w", encoding="utf-8")
for word, count in counter.most_common(topwords):
    f1.writelines('{}:{}\n'.format(word,count))
    f2.writelines('{}\n'.format(word))
f1.flush()
f2.flush()
f1.close()
f2.close()

Is this optimal and performant? Heck no… Does it work? To some extent…

The output of the script is two text files:

One with only the extracted words.
One that includes an occurrence count for each word that may be useful for more optimised lists in the future.

These resulting wordlists still include some unfiltered single letter results, English words, as well as terms and keywords used from the markup languages used. However, the wordlist boasts 1,807,805 unique words in various forms and conjugations and is the biggest freely available wordlist for Danish I have been able to find.

I have shared the initial word lists in a dedicated GitHub repository where I hope to add future improved lists, hashcat rules and masks and other related tools.

If you have any ideas for improvement or want to collaborate on creating Danish wordlists and methodology please get in touch on GitHub, Twitter or email.