Michal Měchura EN GA CS

Language technologist, information architect

mugshot Hello. I am the author of the open-source dictionary writing system Lexonomy and the open-source terminology management platform Terminologue. I have written the Irish-language book An Ríomhaire Ilteangach, a guide to language technology for general readers. I have built or co-built many Irish-language reference websites including the National Terminology Database for Irish, the Placenames Database of Ireland, the National Folklore Collection and the Dictionary and Language Library. I am the author of Xonomy, an open-source, browser-based XML editor. I have written a computational grammar of Irish called Gramadán and I maintain the Irish National Morphology Database.
Fiontar & Scoil na Gaeilge, Dublin City University, Ireland
Foras na Gaeilge, Dublin, Ireland.
Natural Language Processing Centre, Masaryk University, Brno, Czech Republic
Dioplóma Iarchéime sa Ghaeilge Fheidhmeach | Postgraduate Diploma in Applied Irish
Dublin Institute of Technology, 2010
MPhil in Speech and Language Processing
Trinity College, University of Dublin, 2008
logo Fairslator fairslator.com logo Patnáct vět 15vet.cz logo Pota Focal potafocal.com logo Intergaelic intergaelic.com

Publications & talks



Fairslator Demo BIB

EVENT Text, Speech and Dialogue Conference, Brno
This demo introduced Fairslator, an experimental application for removing bias from machine translation. Translations produced by machines – especially when the source language is English – are often biased because of ambiguities in gender, number and forms of address. Fairslator resolves these by examining the output of machine translation, detecting the presence of any bias-triggering ambiguities, and asking the human user how they wish to resolve them: for example, whether gender-ambiguous English words such as ‘student’ and ‘doctor’ should be translated as male or female, or whether the English pronoun ‘you’ should be translated as singular or plural, as formal or informal.

Document or database? The search for the perfect storage paradigm for lexical data. BIB

EVENT Euralex 2022 Conference, Mannheim, Germany

A taxonomy of bias-causing ambiguities in machine translation BIB

PUBLISHED IN Proceedings of the 4th workshop on gender bias in natural language processing (GeBNLP)
PUBLISHER Association for Computational Linguistics, Seattle, Washington
This paper introduces a taxonomy of phenomena which cause bias in machine translation, covering gender bias (people being male and/or female), number bias (singular you versus plural you) and formality bias (informal you versus formal you). Our taxonomy is a formalism for describing situations in machine translation when the source text leaves some of these properties unspecified (eg. does not say whether doctor is male or female) but the target language requires the property to be specified (eg. because it does not have a gender-neutral word for doctor). The formalism described here is used internally by a web-based tool we have built for detecting and correcting bias in the output of any machine translator.
CONFERENCE PAPER with Brian Ó Raghallaigh, Úna Bhreathnach and Gearóid Ó Cleircín

Dare to be different: how user needs determine termbase design BIB

EVENT Multilingual Digital Terminology Today: Design, representation formats and management systems, Padova, Italy

An introduction to lexicographic data modelling BIB

EVENT Lexicom, Telč, Czech Republic

DMLex, a data model for lexicography: an example-by-example introduction BIB

EVENT ELEXIS Showcase Event, Florence, Italy

What You Need to Know About Bias in Machine Translation BIB

As machine translation gets better, the problem of bias — especially gender bias — remains a source of embarrassment for the industry. Why MT bias matters and how major players are trying to fix it.

So you want to build a placenames database: an introduction to toponymic data modelling BIB

EVENT Placenames in Bilingual Areas Workshop, Dublin, Ireland

Ceardlann ar Terminologue BIB

EVENT An Ghaeilge agus an Téarmeolaíocht, Dublin, Ireland

We need to talk about bias in machine translation: the Fairslator whitepaper BIB

Machine translation is getting better all the time but the problem of bias still remains. Translations produced by machines are often biased because of ambiguities in gender, in forms of address, and in word meaning. This whitepaper analyzes the problem and proposes a solution based on automated re-inflection with humans in the loop.


TALK with Brian Ó Raghallaigh

Terminologue and open source terminology solutions BIB

EVENT European Association for Terminology Summit 2021, Online
TALK with Brian Ó Raghallaigh

Introducing Terminologue: a cloud-based, open-source terminology management tool BIB

EVENT XIX EURALEX International Congress, Online

Re‑inventing the phrasebook with rule‑based language technology BIB

EVENT Grammatical Framework Summer School, Singapore and online
An introduction to Czechslator and the technology behind it.

Lexicographic APIs: the state of the art BIB

EVENT eLex 2021 Conference
JOURNAL ARTICLE with Brian Ó Raghallaigh, Aengus Ó Fionnagáin and Sophie Osborne

Developing the Gaois Linguistic Database of Irish-language Surnames BIB

PUBLISHED IN Names: A Journal of Onomastics
In this paper, we are introducing the first-ever open, data-driven linguistic database of Irish-language surnames, along with an algorithm for deriving inflected forms of Irish-language surnames.

A survey of dictionary APIs »

A survey of application programming interfaces (APIs) on the Internet which provide access to lexicographic content in machine-readable formats.



Contributions to e-lexicography BIB

INSTITUTION Masaryk University, Brno
This thesis is about the digitization of lexicography, with focus on dictionaries intended for human users.

The future of dictionary editing BIB

EVENT Lexicom, Mikulov, Moravia



Plausibility filtering with Grammatical Framework BIB

This document describes a technique called plausibility filtering which you can use to prevent a Grammatical Framework (GF) application grammar from generating semantically implausible sentences.

Breaking the tyranny of machine translation BIB

EVENT Grammatical Framework Summer School, Stellenbosch, South Africa
CONFERENCE PAPER with Krasimir Angelov

Editing with Search and Exploration for Controlled Languages BIB

PUBLISHED IN Proceedings of the Sixth International Workshop on Controlled Natural Language
PUBLISHER IOS Press, Maynooth, Ireland
We present an editor for controlled languages which is a combination of a syntax editor and a predictive editor.

Shareable Subentries in Lexonomy as a Solution to the Problem of Multiword Item Placement BIB

EVENT EURALEX 2018, Ljubljana, Slovenia
PUBLISHED IN Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts
This paper introduces a new way of dealing with phraseology in dictionaries. A classical question in lexicography is whether multiword items such as third time lucky should be listed under third, time or lucky. The ideal answer is ‘under all of them’ but, until now, the only way to do that in conventional tree-structured dictionaries has been to keep multiple copies (of what conceptually is one and the same item) in several places throughout the dictionary. We present a way to achieve the same goal without copying. The multiword item becomes a semi-independent subentry which exists in only one copy but appears simultaneously in several places in the dictionary. The structure of the dictionary remains a tree but the lexicographer is empowered to occasionally ‘break out’ of the tree in order to avoid duplication. This paper explains the reasoning behind the concept of shareable subentries and shows how this new functionality has been implemented in the dictionary writing system Lexonomy.
TALK with Miloš Jakubíček, Vojtěch Kovář and Pavel Rychlý

Practical Post- Editing Lexicography with Lexonomy and Sketch Engine BIB

EVENT XVIII EURALEX International Congress: Lexicography in Global Contexts



Introducing Lexonomy: an open-source dictionary writing and publishing system BIB

PUBLISHER Electronic lexicography in the 21st century: Proceedings of eLex 2017 conference, Leiden
This demo introduces Lexonomy (www.lexonomy.eu), a free, open-source, web-based dictionary writing and publishing system. In Lexonomy, users can take a dictionary project from initial set-up to final online publication in a completely self-service fashion, with no technical skills required and no financial cost.

How (not) to build a European Dictionary Portal BIB

EVENT Final Conference of the European Network of e-Lexicography, Leiden

Ar thairseach na haoise digití: mionteangacha agus an ríomhaireacht BIB

EVENT ‘Ar an Imeall i Lár an Domhain?’: An tairseachúlacht i litríocht agus i gcultúr na hÉireann agus na hEorpa, Prague

Towards a Metadata Infrastructure for Online Dictionaries BIB

EVENT European Network of e-Lexicography, Budapest
TALK with Miloš Jakubíček, Vojtěch Kovář and Pavel Rychlý

One-Click Dictionary BIB

EVENT Electronic lexicography in the 21st century (eLex) conference

An Ríomhaire Ilteangach BIB

PUBLISHER Cois Life, Dublin
ISBN 978-1-907494-70-3
Treoirleabhar don teicneolaíocht teanga atá dírithe ar an léitheoir ginearálta. Léitheoireacht riachtanach é seo do gach duine a láimhseálann breis is teanga amháin ar an ríomhaire. | A guide to language technology for general readers. This book is required reading for everybody who uses more than one language on their computer.


TALK with Brian Ó Raghallaigh and Katie Ní Loingsigh

Towards a database of Irish surnames BIB

EVENT 25th Spring Conference of the Society for Name Studies in Britain and Ireland

Things to think about when building a dictionary website BIB

EVENT European Network of e-Lexicography, Barcelona, Catalonia

Data Structures in Lexicography: from Trees to Graphs BIB

PUBLISHED IN Recent Advances in Slavonic Natural Language Processing
In lexicography, a dictionary entry is typically encoded in XML as a tree: a hierarchical data structure of parent-child relations where every element has at most one parent. This choice of data structure makes some aspects of the lexicographer’s work unnecessarily difficult, from deciding where to place multi-word items to reversing anentire bilingual dictionary. This paper proposes that these and other notorious areas of difficulty can be made easier by remodelling dictionaries as graphs rather than trees. However, unlike other authors who have proposed a radical departure from tree structures and whose proposals have remained largely unimplemented, this paper proposes a conservative compromise in which existing tree structures become augmented with specific types of inter-entry relations designed to solve specific problems.



Do minority languages need the same language technology as majority languages? BIB

EVENT British-Irish Council conference on language technology in indigenous, minority and lesser-used languages, Dublin Castle, Ireland

Do minority languages need machine translation? »

I want to bust the myth that machine translation is necessary for the revival of minority languages.



Irish National Morphology Database: a high-accuracy open-source dataset of Irish words BIB

PUBLISHED IN Proceedings of the First Celtic Language Technology Workshop
The Irish National Morphology Database is a human-verified, Official Standard-compliant dataset containing the inflected forms and other morphosyntactic properties of Irish nouns,adjectives, verbs and prepositions. It is being developed by Foras na Gaeilge as part of the New English-Irish Dictionary project. This paper introduces this dataset and its accompanying software library Gramadán.

10 reasons why Irish is an absolutely awesome language »

And these are proper linguistic reasons, too – none of that starry-eyed sentimental nonsense about the language being ‘beautiful’ or ‘romantic’.

Breathing new life into old data: how to retro-digitize a dictionary »

What I learned from a project where we retro-digitized two Irish dictionaries and published them on the web.



The linguistic relativity of up and down »

A nice and simple example of how learning a new language causes you to start perceiving the world differently.



Léacslann: a platform for building dictionary writing systems BIB

PUBLISHED IN Proceedings of the 15th Euralex International Congress
PUBLISHER University of Oslo, Oslo
The purpose of this demo is to introduce Léacslann, a new platform for building dictionary writing systems (DWS) and terminology management systems (TMS) as well as other lexicographic and reference applications. Léacslann can be used without anyknowledge of programming to create a basic lexical database with an arbitrary structure. This will be demonstrated in the first half of the demo, while the second half will show how a software developer can customize Léacslann for more demanding applications.
TALK with Brian Ó Raghallaigh

The logainm.ie Placenames Database of Ireland: Software demonstration BIB

EVENT Placenames Workshop 2012

Idir foclóir agus léarscáil: Bunachar Logainmneacha na hÉireann BIB

EVENT Daonscoil na Mumhan, Waterford, Ireland

Léacslann Tutorial BIB

PUBLISHER Dublin City University



When definitions are not enough BIB

PUBLISHED IN Proceedings of Terminology and Knowledge Engineering (TKE) Conference
PUBLISHER Dublin City University
This paper introduces Compositional Term Diagrams (CTDs) as a formalism for analysing the structure of multi-word terms. CTDs have the potential to help terminologists resolve ambiguities related to transitivity (“who does what to whom”), modification (“what modifies what”) and evocation (“which sense is evoked by this word?”).
TALK with Brian Ó Raghallaigh

How to build a termbase for 500,000 users (and live to tell the story) BIB

EVENT Terminology and Knowledge Engineering (TKE) Conference, Dublin, Ireland

What WordNet does not know about selectional preferences BIB

PUBLISHED IN Proceedings of the 14th Euralex International Congress
PUBLISHER Fryske Akademy, Ljouwert/Leeuwarden
Selectional preferences are the tendencies of words to co-occur with other words that belong to certain semantictypes. In this paper, I will investigate how closely these corpus-attested preferences correspond to WordNet. For example, for all possible direct objects of cancel, is there a single category (or a union of several categories) in WordNet that subsumes them, and only them? Selectional preferences manifest themselves in authentic texts andcan be revealed through corpus analysis. I will introduce an experimental tool I have built which attempts to do this automatically by aligning corpus-extracted lists of collocates (for example a list of the direct objects of cancel) with WordNet. The strength of this method is that it can discover and name selectional preferences automatically, but its weakness is that it can only do so when WordNet contains a suitable category. We will see that WordNet often lacks a category (or even a union of several categories) that fully corresponds to an attested selectional preference – for example, there is no category in WordNet that includes all the kinds of events that can be direct objects of cancel (meeting, wedding, concert etc.) but excludes those that cannot (accident, sunset, invention etc.).
TALK with Brian Ó Raghallaigh

The Focal.ie National Terminology Database for Irish BIB

EVENT 14th Euralex International Congress, Ljouwert/Leeuwarden

Living with a diacritic »

No, this is not an article about living with an obscure illness. It’s an article about living with a name no-one can spell correctly.


TALK with Brian Ó Raghallaigh

User-Friendliness: the key to promoting a minority language on the Internet BIB

EVENT 12th International Conference on Minority Languages, Tartu, Estonia

Flags as language symbols – so what is the problem? »

Using country flags as if they were language symbols is bad. So why does everybody keep on doing it? And is it really so bad?

Linguistic relativity: fact or wishful thinking? »

Most linguists secretly wish the Sapir-Whorf Hypothesis to be true. But is it?



Giving them what they want: search strategies for electronic dictionaries BIB

PUBLISHED IN Proceedings of the 13th Euralex International Congress
PUBLISHER Universitat Pompeu Fabra, Barcelona
This paper deals with how humans search electronic dictionaries. It raises the point that users often make dictionary searches with misspellings, with inflected words copied and pasted from elsewhere, with complete sentences or fragments thereof, and with other kinds of low-quality input, and suggests methods for dealing with such phenomena in a pre-emptive manner. The issues addressed include searching with inflections, dealing with multi-word items, misspelling detection and text normalization. Additionally, the value of log files is emphasized as a source of information on user behaviour.

Cá bhfuil mo shínte fada? – ionchódú téacs ar ríomhairí BIB

EVENT Engineers Ireland, Dublin, Ireland

Selectional Preferences, Corpora and Ontologies BIB

INSTITUTION Trinity College, University of Dublin
This work presents a technique for exploring the selectional preferences ofwords in a semi-automatic way. The technique combines corpora with ontologiessuch as WordNet.The term selectional preference denotes a word’s tendency to co-occur withwords that belong to certain lexical sets. For example, the adjective delicious prefers to modify nouns that denote food and the verb marry prefers subjects and objects that denote humans. This work develops techniques for associating corpus-attested selectional preferences with concepts in an ontology. It shows how lexical sets can be derived from ontologies and how corpus-extracted collocates of a word can then be aligned with these lexical sets to reveal any selectional preferences the word has. An additional contribution provided here is an insight into the limitations of this method. The work presents evidence for the conclusion that aligning selectional preferences with an ontology is useful for some purposes, but fundamentally inaccurate because currently existing ontologies do not accurately reflect the mental categories evoked in selectional preferences.

Sub Specie Aeternitatis »

Aiste leis an teangeolaí Seiceach Pavel Eisner a amharcann ar athbheochan na Seicise agus ar a bhfuil i ndán feasta di féin agus do mhionteangacha eile.



Localization into Irish BIB

PUBLISHED IN Multilingual Computing and Technology

Ionchódú Téacs ar Ríomhairí BIB




Finding the right structure for lexicographical data: experiences from a terminology project BIB

PUBLISHED IN Proceedings of the 13th Euralex International Congress
PUBLISHER Edizioni dell'Orso, Turin

Uimhreacha na Gaeilge BIB

Sa saothar seo tá cuntasar iomlán na rialacha a bhaineann le húsáid uimhreacha sa Ghaeilge. Mar is eol donléitheoir, tá córas uimhreacha na Gaeilge an-chasta, rud a chuireann fonn ar lucht scríofa leabhar gramadaí a gcuid cuntas ar an chóras a shimpliú agus ceisteanna áirithe a fhágáil gan freagra soiléir mar bheadh an freagra casta agus deacair le tuiscint. Sa saothar seo, tá a mhalairt de chur chuige i gceist. Rinne mé iarracht cur síos a dhéanamh ar chóras na n-uimhreachaar bhealach atá chomh hiomlán agus is féidir, in ainneoin a chastachta. Fónfaidh an saothar seo don té atá ar thóir cruinnis.



A practical guide for functional text analysis: Analyzing English texts for field, mode, tenor and communicative effectiveness BIB

This document provides a scheme for analyzing English texts from a functional perspective. The document contains information adapted from Chapters 8, 10 and 12 – 16 of Books 2 and 3 of the Open University course E303 English Grammar in Context as it was presented in 2005, as well as from the set book Longman Student Grammar of Spoken and Written English and from the course’s associated readings. Skills in functional analysis are developed in the course books; this document re-iterates in concise form the main points to consider when performing the analysis.



Czech–English translation difficulties arising from differences in word order BIB

This work deals with Czech-English translation difficulties that result from differences in word order between the syntax of the two languages. A functional framework is used to interpret the implications of the syntactical differences. Both English and Czech have a tendency to present given information at the beginning of a clause and new information at the end, but the flexibility of Czech word order makes it possible to observe this principle more consistently than English syntax makes possible. Additionally, Czech, unlike English, does not observe the end-weight principle and therefore long stretches of circumstantial information do not prefer to be placed at the end of a clause. Both these differences result in significant mismatches in word order between Czech clauses and their English translation equivalents.