Datasets by MBM

Here is a list of various language-related datasets that I have either created myself or compiled from other sources, and which I am making available to the world under copyleft or open-source licenses.

Lemmatization Lists »

Machine-readable lists of lemma-token pairs in 23 languages.

Pota Focal House Glossary »

An Irish-English dictionary for learners of Irish with over 5,000 entries.

Irish Sentence Bank »

About 4,500 sentences in Irish, tokenized, manually lemmatized and translated into English.

Irish Word Frequency List »

About 6,500 Irish lemmas (= "words") ordered by corpus frequency, with noise removed.

Wordnet in Microsoft SQL Server »

Princeton Wordnet converted into a relational database format.