These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.
These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.
18 July 2016
- Irish (ga) updated: lots more lemma/token pairs added.
- Manx Gaelic (gv) added.
- Various Hunspell dictionaries from the OpenOffice.org website
- Deutsches Morphologie-Lexikon by Daniel Naber
- Lexique by Boris New and Christophe Pallier
- e_lemma.txt by Yasumasa Someya
- Multext East (only those morphological lexicons that are under a free licence are used)
- Morphological dictionaries from FreeLing
- SALDO morphological lexicon
- Irish National Morphology Database
- Various lists by Kevin Scannell
If I seem to have forgotten anybody, please remind me.