Datasets by MBM

Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

Available under the Open Database License


18 July 2016

  • Irish (ga) updated: lots more lemma/token pairs added.
  • Manx Gaelic (gv) added.


If I seem to have forgotten anybody, please remind me.