Michal Měchura /ˈmɪxal ˈmɲexura/ EN GA CS

mugshot Hello! I build linguistic and lexicographic software. I work for Dublin City University where I look after the technology behind the National Terminology Database for Irish, the Placenames Database of Ireland and other online platforms. Also, I work for Foras na Gaeilge as a language technologist on the New English–Irish Dictionary, on Teanglann and on other lexicographic projects. I am the author of the open-source dictionary writing system Lexonomy and the open-source terminology management platform Terminologue.
PhD Computer Science, Masaryk University, 2024, thesis: Data Structures in Lexicography | PgDip Applied Irish, Dublin Institute of Technology, 2010 | MPhil Speech and Language Processing, University of Dublin/Trinity College Dublin, 2008, dissertation: Selectional Preferences, Corpora and Ontologies | BA (Hons), The Open University, 2005
logo Fairslator fairslator.com logo Native Dialogs nativedialogs.com logo Pota Focal potafocal.com logo An Ríomhaire
Ilteangach

Publications & talks

2024

TALK

Zaujatost ve strojovém překladu a co s ní BIB

EVENT Jeronýmovy dny 2024, Praha
TALK

Bias in machine translation: challenges, techniques, perspectives BIB

EVENT AMTA (Association for Machine Translation in the Americas) Conference 2024: Tutorials Day
PH.D. THESIS

Data Structures in Lexicography BIB

INSTITUTION Masaryk University
TALK

A critical look at the data structure behind Logainm.ie BIB

EVENT Thirtieth Spring Conference of Society for Name Studies in Britain and Ireland, Dublin
BLOG

What lexicographers need to know about DMLex

An unofficial introduction to the Data Model for Lexicography

2023

TALK

Correcting biased translations with the Fairslator API BIB

EVENT Translating and the Computer 45, Luxembourg
TALK

Gender bias in machine translation and what terminologists can do about it BIB

EVENT EAFT Summit, Barcelona
TALK

Lexicography versus XML BIB

EVENT Declarative Amsterdam, Amsterdam
INTERVIEW

Creating an Inclusive AI Future: The Importance of Non-Binary Representation BIB

WHERE machinetranslation.com
I spoke to machinetranslation.com about bias in machine translation, about Fairslator, and about my vision for “human-assisted machine translation”.
BLOG

How I reinvented the wheel and discovered projectional editing

Trust me, you want a code editor that doesn’t let you change the code.

2022

CONFERENCE PAPER

Introducing Fairslator: a machine translation bias removal tool BIB

EVENT Translating and the Computer 44, Luxembourg
PUBLISHED IN Translating and the Computer 44 Proceedings
PUBLISHER Editions Tradulex
ISBN 978-2-9701733-0-4
TALK

Za námi mnoho, před námi ještě víc: digitalizace lexikografie včera, dnes a zítra BIB

EVENT Seminář Ústavu Českého národního korpusu, Praha
TALK

Fairslator Demo BIB

EVENT Text, Speech and Dialogue Conference, Brno
INTERVIEW

When the machine asks the human: in conversation with Michal Měchura BIB

WHERE Goethe-Institut: Artificially Correct
Germany's Goethe-Institut had a few questions to ask me about my Fairslator project.
INTERVIEW

Wenn die Maschine den Menschen fragt: im Gespräch mit Michal Měchura BIB

WHERE Goethe-Institut: Artificially Correct
Das Goethe-Institut hatte ein paar Fragen zu meinem Projekt Fairslator.
CONFERENCE PAPER

Document or database? The search for the perfect storage paradigm for lexical data. BIB

EVENT Euralex 2022 Conference, Mannheim, Germany
CONFERENCE PAPER

A taxonomy of bias-causing ambiguities in machine translation BIB

PUBLISHED IN Proceedings of the 4th workshop on gender bias in natural language processing (GeBNLP)
PUBLISHER Association for Computational Linguistics
This paper introduces a taxonomy of phenomena which cause bias in machine translation, covering gender bias (people being male and/or female), number bias (singular you versus plural you) and formality bias (informal you versus formal you). Our taxonomy is a formalism for describing situations in machine translation when the source text leaves some of these properties unspecified (eg. does not say whether doctor is male or female) but the target language requires the property to be specified (eg. because it does not have a gender-neutral word for doctor). The formalism described here is used internally by a web-based tool we have built for detecting and correcting bias in the output of any machine translator.
CONFERENCE PAPER with Brian Ó Raghallaigh, Úna Bhreathnach and Gearóid Ó Cleircín

Dare to be different: how user needs determine termbase design BIB

EVENT Multilingual Digital Terminology Today: Design, representation formats and management systems, Padova, Italy
TALK

An introduction to lexicographic data modelling BIB

EVENT Lexicom, Telč, Czech Republic
TALK

DMLex, a data model for lexicography: an example-by-example introduction BIB

EVENT ELEXIS Showcase Event, Florence, Italy
MAGAZINE ARTICLE

What You Need to Know About Bias in Machine Translation BIB

PUBLISHED IN Slator.com
As machine translation gets better, the problem of bias — especially gender bias — remains a source of embarrassment for the industry. Why MT bias matters and how major players are trying to fix it.
TALK

So you want to build a placenames database: an introduction to toponymic data modelling BIB

EVENT Placenames in Bilingual Areas Workshop, Dublin, Ireland
TALK

Ceardlann ar Terminologue BIB

EVENT An Ghaeilge agus an Téarmeolaíocht, Dublin, Ireland
REPORT

We need to talk about bias in machine translation: the Fairslator whitepaper BIB

Machine translation is getting better all the time but the problem of bias still remains. Translations produced by machines are often biased because of ambiguities in gender, in forms of address, and in word meaning. This whitepaper analyzes the problem and proposes a solution based on automated re-inflection with humans in the loop.

2021

TALK with Brian Ó Raghallaigh

Terminologue and open source terminology solutions BIB

EVENT European Association for Terminology Summit 2021, Online
TALK with Brian Ó Raghallaigh

Introducing Terminologue: a cloud-based, open-source terminology management tool BIB

EVENT XIX EURALEX International Congress, Online
TALK

Re‑inventing the phrasebook with rule‑based language technology BIB

EVENT Grammatical Framework Summer School, Singapore and online
An introduction to Czechslator and the technology behind it.
TALK

Lexicographic APIs: the state of the art BIB

EVENT eLex 2021 Conference
JOURNAL ARTICLE with Brian Ó Raghallaigh, Aengus Ó Fionnagáin and Sophie Osborne

Developing the Gaois Linguistic Database of Irish-language Surnames BIB

PUBLISHED IN Names: A Journal of Onomastics
In this paper, we are introducing the first-ever open, data-driven linguistic database of Irish-language surnames, along with an algorithm for deriving inflected forms of Irish-language surnames.
BLOG

A survey of dictionary APIs

A survey of application programming interfaces (APIs) on the Internet which provide access to lexicographic content in machine-readable formats.

2019

TALK

The future of dictionary editing BIB

EVENT Lexicom, Mikulov, Moravia

2018

MANUSCRIPT

Plausibility filtering with Grammatical Framework BIB

This document describes a technique called plausibility filtering which you can use to prevent a Grammatical Framework (GF) application grammar from generating semantically implausible sentences.
TALK

Breaking the tyranny of machine translation BIB

EVENT Grammatical Framework Summer School, Stellenbosch, South Africa
CONFERENCE PAPER with Krasimir Angelov

Editing with Search and Exploration for Controlled Languages BIB

PUBLISHED IN Proceedings of the Sixth International Workshop on Controlled Natural Language
PUBLISHER IOS Press
We present an editor for controlled languages which is a combination of a syntax editor and a predictive editor.
CONFERENCE PAPER

Shareable Subentries in Lexonomy as a Solution to the Problem of Multiword Item Placement BIB

EVENT EURALEX 2018, Ljubljana, Slovenia
PUBLISHED IN Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts
This paper introduces a new way of dealing with phraseology in dictionaries. A classical question in lexicography is whether multiword items such as third time lucky should be listed under third, time or lucky. The ideal answer is ‘under all of them’ but, until now, the only way to do that in conventional tree-structured dictionaries has been to keep multiple copies (of what conceptually is one and the same item) in several places throughout the dictionary. We present a way to achieve the same goal without copying. The multiword item becomes a semi-independent subentry which exists in only one copy but appears simultaneously in several places in the dictionary. The structure of the dictionary remains a tree but the lexicographer is empowered to occasionally ‘break out’ of the tree in order to avoid duplication. This paper explains the reasoning behind the concept of shareable subentries and shows how this new functionality has been implemented in the dictionary writing system Lexonomy.
TALK with Miloš Jakubíček, Vojtěch Kovář and Pavel Rychlý

Practical Post- Editing Lexicography with Lexonomy and Sketch Engine BIB

EVENT XVIII EURALEX International Congress: Lexicography in Global Contexts

2017

CONFERENCE PAPER

Introducing Lexonomy: an open-source dictionary writing and publishing system BIB

PUBLISHER Electronic lexicography in the 21st century: Proceedings of eLex 2017 conference
This demo introduces Lexonomy (www.lexonomy.eu), a free, open-source, web-based dictionary writing and publishing system. In Lexonomy, users can take a dictionary project from initial set-up to final online publication in a completely self-service fashion, with no technical skills required and no financial cost.
TALK

How (not) to build a European Dictionary Portal BIB

EVENT Final Conference of the European Network of e-Lexicography, Leiden
TALK

Ar thairseach na haoise digití: mionteangacha agus an ríomhaireacht BIB

EVENT ‘Ar an Imeall i Lár an Domhain?’: An tairseachúlacht i litríocht agus i gcultúr na hÉireann agus na hEorpa, Prague
TALK

Towards a Metadata Infrastructure for Online Dictionaries BIB

EVENT European Network of e-Lexicography, Budapest
TALK with Miloš Jakubíček, Vojtěch Kovář and Pavel Rychlý

One-Click Dictionary BIB

EVENT Electronic lexicography in the 21st century (eLex) conference
BOOK

An Ríomhaire Ilteangach BIB

PUBLISHER Cois Life
ISBN 978-1-907494-70-3
Treoirleabhar don teicneolaíocht teanga atá dírithe ar an léitheoir ginearálta. Léitheoireacht riachtanach é seo do gach duine a láimhseálann breis is teanga amháin ar an ríomhaire. | A guide to language technology for general readers. This book is required reading for everybody who uses more than one language on their computer.

2016

TALK with Brian Ó Raghallaigh and Katie Ní Loingsigh

Towards a database of Irish surnames BIB

EVENT 25th Spring Conference of the Society for Name Studies in Britain and Ireland
TALK

Things to think about when building a dictionary website BIB

EVENT European Network of e-Lexicography, Barcelona, Catalonia
CONFERENCE PAPER

Data Structures in Lexicography: from Trees to Graphs BIB

PUBLISHED IN Recent Advances in Slavonic Natural Language Processing
In lexicography, a dictionary entry is typically encoded in XML as a tree: a hierarchical data structure of parent-child relations where every element has at most one parent. This choice of data structure makes some aspects of the lexicographer’s work unnecessarily difficult, from deciding where to place multi-word items to reversing anentire bilingual dictionary. This paper proposes that these and other notorious areas of difficulty can be made easier by remodelling dictionaries as graphs rather than trees. However, unlike other authors who have proposed a radical departure from tree structures and whose proposals have remained largely unimplemented, this paper proposes a conservative compromise in which existing tree structures become augmented with specific types of inter-entry relations designed to solve specific problems.

2015

TALK

Do minority languages need the same language technology as majority languages? BIB

EVENT British-Irish Council conference on language technology in indigenous, minority and lesser-used languages, Dublin Castle, Ireland
BLOG

Do minority languages need machine translation?

I want to bust the myth that machine translation is necessary for the revival of minority languages.

2014

CONFERENCE PAPER

Irish National Morphology Database: a high-accuracy open-source dataset of Irish words BIB

PUBLISHED IN Proceedings of the First Celtic Language Technology Workshop
The Irish National Morphology Database is a human-verified, Official Standard-compliant dataset containing the inflected forms and other morphosyntactic properties of Irish nouns,adjectives, verbs and prepositions. It is being developed by Foras na Gaeilge as part of the New English-Irish Dictionary project. This paper introduces this dataset and its accompanying software library Gramadán.
BLOG

10 reasons why Irish is an absolutely awesome language

And these are proper linguistic reasons, too – none of that starry-eyed sentimental nonsense about the language being ‘beautiful’ or ‘romantic’.

Breathing new life into old data: how to retro-digitize a dictionary

What I learned from a project where we retro-digitized two Irish dictionaries and published them on the web.

2013

BLOG

The linguistic relativity of up and down

A nice and simple example of how learning a new language causes you to start perceiving the world differently.

2012

CONFERENCE PAPER

Léacslann: a platform for building dictionary writing systems BIB

PUBLISHED IN Proceedings of the 15th Euralex International Congress
PUBLISHER University of Oslo
The purpose of this demo is to introduce Léacslann, a new platform for building dictionary writing systems (DWS) and terminology management systems (TMS) as well as other lexicographic and reference applications. Léacslann can be used without anyknowledge of programming to create a basic lexical database with an arbitrary structure. This will be demonstrated in the first half of the demo, while the second half will show how a software developer can customize Léacslann for more demanding applications.
TALK with Brian Ó Raghallaigh

The logainm.ie Placenames Database of Ireland: Software demonstration BIB

EVENT Placenames Workshop 2012
TALK

Idir foclóir agus léarscáil: Bunachar Logainmneacha na hÉireann BIB

EVENT Daonscoil na Mumhan, Waterford, Ireland
REPORT

Léacslann Tutorial BIB

PUBLISHER Dublin City University

2010

CONFERENCE PAPER

When definitions are not enough BIB

PUBLISHED IN Proceedings of Terminology and Knowledge Engineering (TKE) Conference
PUBLISHER Dublin City University
This paper introduces Compositional Term Diagrams (CTDs) as a formalism for analysing the structure of multi-word terms. CTDs have the potential to help terminologists resolve ambiguities related to transitivity (“who does what to whom”), modification (“what modifies what”) and evocation (“which sense is evoked by this word?”).
TALK with Brian Ó Raghallaigh

How to build a termbase for 500,000 users (and live to tell the story) BIB

EVENT Terminology and Knowledge Engineering (TKE) Conference, Dublin, Ireland
CONFERENCE PAPER

What WordNet does not know about selectional preferences BIB

PUBLISHED IN Proceedings of the 14th Euralex International Congress
PUBLISHER Fryske Akademy
Selectional preferences are the tendencies of words to co-occur with other words that belong to certain semantictypes. In this paper, I will investigate how closely these corpus-attested preferences correspond to WordNet. For example, for all possible direct objects of cancel, is there a single category (or a union of several categories) in WordNet that subsumes them, and only them? Selectional preferences manifest themselves in authentic texts andcan be revealed through corpus analysis. I will introduce an experimental tool I have built which attempts to do this automatically by aligning corpus-extracted lists of collocates (for example a list of the direct objects of cancel) with WordNet. The strength of this method is that it can discover and name selectional preferences automatically, but its weakness is that it can only do so when WordNet contains a suitable category. We will see that WordNet often lacks a category (or even a union of several categories) that fully corresponds to an attested selectional preference – for example, there is no category in WordNet that includes all the kinds of events that can be direct objects of cancel (meeting, wedding, concert etc.) but excludes those that cannot (accident, sunset, invention etc.).
TALK with Brian Ó Raghallaigh

The Focal.ie National Terminology Database for Irish BIB

EVENT 14th Euralex International Congress, Ljouwert/Leeuwarden
BLOG

Living with a diacritic

No, this is not an article about living with an obscure illness. It’s an article about living with a name no-one can spell correctly.

2009

TALK with Brian Ó Raghallaigh

User-Friendliness: the key to promoting a minority language on the Internet BIB

EVENT 12th International Conference on Minority Languages, Tartu, Estonia
BLOG

Flags as language symbols – so what is the problem?

Using country flags as if they were language symbols is bad. So why does everybody keep on doing it? And is it really so bad?

Linguistic relativity: fact or wishful thinking?

Most linguists secretly wish the Sapir-Whorf Hypothesis to be true. But is it?

2008

CONFERENCE PAPER

Giving them what they want: search strategies for electronic dictionaries BIB

PUBLISHED IN Proceedings of the 13th Euralex International Congress
PUBLISHER Universitat Pompeu Fabra
This paper deals with how humans search electronic dictionaries. It raises the point that users often make dictionary searches with misspellings, with inflected words copied and pasted from elsewhere, with complete sentences or fragments thereof, and with other kinds of low-quality input, and suggests methods for dealing with such phenomena in a pre-emptive manner. The issues addressed include searching with inflections, dealing with multi-word items, misspelling detection and text normalization. Additionally, the value of log files is emphasized as a source of information on user behaviour.
TALK

Cá bhfuil mo shínte fada? – ionchódú téacs ar ríomhairí BIB

EVENT Engineers Ireland, Dublin, Ireland
M.PHIL. DISSERTATION

Selectional Preferences, Corpora and Ontologies BIB

INSTITUTION Trinity College, University of Dublin
This work presents a technique for exploring the selectional preferences ofwords in a semi-automatic way. The technique combines corpora with ontologiessuch as WordNet.The term selectional preference denotes a word’s tendency to co-occur withwords that belong to certain lexical sets. For example, the adjective delicious prefers to modify nouns that denote food and the verb marry prefers subjects and objects that denote humans. This work develops techniques for associating corpus-attested selectional preferences with concepts in an ontology. It shows how lexical sets can be derived from ontologies and how corpus-extracted collocates of a word can then be aligned with these lexical sets to reveal any selectional preferences the word has. An additional contribution provided here is an insight into the limitations of this method. The work presents evidence for the conclusion that aligning selectional preferences with an ontology is useful for some purposes, but fundamentally inaccurate because currently existing ontologies do not accurately reflect the mental categories evoked in selectional preferences.
BLOG

Sub Specie Aeternitatis

Aiste leis an teangeolaí Seiceach Pavel Eisner a amharcann ar athbheochan na Seicise agus ar a bhfuil i ndán feasta di féin agus do mhionteangacha eile.

2007

MAGAZINE ARTICLE

Localization into Irish BIB

PUBLISHED IN Multilingual Computing and Technology
MAGAZINE ARTICLE

Ionchódú Téacs ar Ríomhairí BIB

PUBLISHED IN Comhar

2006

CONFERENCE PAPER

Finding the right structure for lexicographical data: experiences from a terminology project BIB

PUBLISHED IN Proceedings of the 13th Euralex International Congress
PUBLISHER Edizioni dell'Orso
MANUSCRIPT

Uimhreacha na Gaeilge BIB

Sa saothar seo tá cuntasar iomlán na rialacha a bhaineann le húsáid uimhreacha sa Ghaeilge. Mar is eol donléitheoir, tá córas uimhreacha na Gaeilge an-chasta, rud a chuireann fonn ar lucht scríofa leabhar gramadaí a gcuid cuntas ar an chóras a shimpliú agus ceisteanna áirithe a fhágáil gan freagra soiléir mar bheadh an freagra casta agus deacair le tuiscint. Sa saothar seo, tá a mhalairt de chur chuige i gceist. Rinne mé iarracht cur síos a dhéanamh ar chóras na n-uimhreachaar bhealach atá chomh hiomlán agus is féidir, in ainneoin a chastachta. Fónfaidh an saothar seo don té atá ar thóir cruinnis.

2005

MANUSCRIPT

A practical guide for functional text analysis: Analyzing English texts for field, mode, tenor and communicative effectiveness BIB

This document provides a scheme for analyzing English texts from a functional perspective. The document contains information adapted from Chapters 8, 10 and 12 – 16 of Books 2 and 3 of the Open University course E303 English Grammar in Context as it was presented in 2005, as well as from the set book Longman Student Grammar of Spoken and Written English and from the course’s associated readings. Skills in functional analysis are developed in the course books; this document re-iterates in concise form the main points to consider when performing the analysis.

2004

MANUSCRIPT

Czech–English translation difficulties arising from differences in word order BIB

This work deals with Czech-English translation difficulties that result from differences in word order between the syntax of the two languages. A functional framework is used to interpret the implications of the syntactical differences. Both English and Czech have a tendency to present given information at the beginning of a clause and new information at the end, but the flexibility of Czech word order makes it possible to observe this principle more consistently than English syntax makes possible. Additionally, Czech, unlike English, does not observe the end-weight principle and therefore long stretches of circumstantial information do not prefer to be placed at the end of a clause. Both these differences result in significant mismatches in word order between Czech clauses and their English translation equivalents.