A short history of digitisation in lexicography

Dictionaries have recently migrated from paper to screens. Let’s take stock of where that leaves us.

Note: This article is an edited extract from my PhD thesis Data Structures in Lexicography.

This article follows on from a previous one on human-oriented lexicography where I explained what I think dictionaries are, how they differ from NLP-style “language resources”, and what types of content they contain: entries, headwords, senses and so on. In this article, I will review to what extent lexicography has become digitised in the last few decades.

For most of their history, dictionaries have existed as printed books. Today, however, the popular image of “the dictionary” as a book is outdated and hugely out of sync with how lexicography is actually done. Today’s lexicography is a discipline where everything happens on computers, either fully automatically or in interaction with humans: this applies to how dictionaries are made (using corpus query software and dictionary writing systems) as well as to how dictionaries are delivered to end-users (as websites and mobile apps). The printed dictionary market has shrunk to a shadow of its former self while online dictionaries rule the day. Most dictionary projects today are designed as digital-only, with no printed output planned. Like many other disciplines, lexicography is going through a digital transformation. The purpose of this article is to clarify how far advanced we are in this transformation and how much of it is still ahead of us.

The process of making and delivering a dictionary is something which unfolds in stages. The first stage is when we are discovering facts about words, these days typically from a corpus. The second stage is when we are organising these facts into the form of dictionary entries. The final stage is when we are delivering dictionaries to human users on their screens or (rarely) on printed pages. I will argue that not all stages have been digitised equally. Although the first stage – discovery – has been digitised thoroughly and in some sense “completely”, the remaining two stages – organisation and delivery – have only been digitised rather superficially so far and there is untapped potential in them yet.

From citation slips to corpus query systems

To say something about a word, the lexicographer must know something about it first. Pre-digital lexicographers relied on their introspection and their own subjective judgment to produce lists of the meanings a word has, to compose example sentences, and so on. From the 19th century onwards this started becoming more objective and empirical with the introduction of citation slips and various reading programmes. And, from late 20th century onwards, these analog tools have started being replaced by methods from corpus linguistics and from natural language processing.

The use of corpuses and computational methods for lexicography was pioneered in the 1980s by the now legendary COBUILD project. Today, putting NLP at the service of lexicography – for the purposes of knowledge acquisition – is a well-established research programme. Computational methods have given lexicographers previously unheard-of superpowers such as automatic discovery of collocations based on various statistical measures, automatic word-sense discovery through clustering of collocates, automatic discovery of synonyms, antonyms and other semantic or paradigmatic relations, and even finding “good” dictionary examples based on heuristics such as “prefer short sentences with simple words in them”. Corpus-based lexicography is now the standard, practically all dictionary projects begin by deciding which corpus to work from. The process of compiling a dictionary entry almost always begins with using a corpus query system such as Sketch Engine to discover facts about the headword.

The corpus turn in lexicography has introduced three major categories of innovations. Firstly, they enabled the existence of superhumanly large corpuses which would have been unachievable using analog tools: in other disciplines such large datasets are called Big Data. Secondly, they brought statistical methods that can be used to analyze these corpuses more objectively than a human lexicographer could, and at the same time bring to light knowledge that a human person might not even notice. Thirdly, new models of human-computer interaction have emerged, concepts such as keyword in context and word sketch, which allow the human lexicographer to take note of the outputs of the corpus methods and understand them.

We can therefore say that the initial stage of the entire lexicographic process – knowledge acquisition – has already been digitised so thoroughly and so deeply that we have actually redefined it into something quite different from what it was in pre-digital times. Today’s corpus tools are not just better versions of paper-based citation slips and reading programmes: they are qualitatively different, delivering results that would have been unachievable without them. The NLP methods used for extracting knowledge from corpuses will certainly continue to improve incrementally, but it seems that no major new innovations are likely to emerge in this area: the potential offered by the digital medium has been exploited more or less fully here.

As a parallel trend, we sometimes see efforts to complement corpus data with insights from research on how people use online dictionaries, especially search log analysis, and with insights from studies in psycholinguistics and language acquisition such as word prevalence and age of acquisition, or learner level. These play a role mainly in deciding which headwords to include in a dictionary and in prioritizing which headwords the lexicographers should process first. For everything else, “corpus is king” and will probably remain on its throne for the foreseeable future.

Rise of the robot lexicographers

The lexicographer’ job is to “translate” knowledge from the corpus into the form of a dictionary entry that is going to be comprehensible and useful to the intended end-users. Until recently, the everyday reality for working lexicographers has been to do the “translating” manually: on one screen, the lexicographer watches the results of the corpus analysis (using a corpus query system such as Sketch Engine) and, on another screen, he or she compiles the dictionary entry by typing, copying and pasting short pieces of text into a prepared structure in a dictionary writing system such as Lexonomy. All knowledge from the corpus passes through the mind and fingers of the person in front of the keyboard before it becomes a dictionary. Human minds and fingers are the bottlenecks of the lexicographic process – they are what makes dictionary projects take so long and cost so much money – so it is no wonder that there is a push towards automation here.

Automation in this area began modestly over a decade ago in the form of ergonomic improvements such as tickbox lexicography in Sketch Engine (which makes it possible to batch-copy content from the corpus tool into the dictionary) and content pulling in Lexonomy (which allows the lexicographer to “pull” content from the corpus into the dictionary entry on request).

More recently, we have been seeing more radical attempts at automation, when people are experimenting with the automatic generation of entire dictionary entries and entire dictionaries, either “at once” (the One-Click Dictionary method in Sketch Engine which makes it possible to generate an entire proto-dictionary from the corpus), or “gradually” when dictionaries are generated step by step in interaction with a human editor (the Million-Click Dictionary method). Experience so far shows that it is possible to make the lexicographic process go faster and cheaper this way, importantly without having to accept inconvenient trade-offs affecting the quality of the resulting dictionary.

However, nothing is without consequences. Here, like everywhere else, increased automation forces a certain redefinition of what we are actually doing. Firstly, the role of the lexicographer is shifting from the role of a “driver” of the entire process to the role of a post-editor, someone who only corrects the computer’s mistakes and intervenes where the computer does not know how to proceed. This transformation, which is only just beginning in lexicography, is already far advanced in other language-related disciplines, for example in translation (the role of the translator is changing to the role of a machine translation post-editor) and in copywriting (the role of people who produce marketing texts, such as product descriptions in online shopping, is changing to the role of authors of templates from which machines then generate finished texts). Secondly, there is a tendency to simplify the structure of dictionary entries. Dictionary entries generated by automatic methods tend to have a flatter structure (without a complex hierarchy of senses and subsenses) and contain a narrower repertoire of content types than dictionaries compiled by humans. In other words, automatically generated dictionaries are shaped by what can be obtained from the corpus rather than what lexicographers ideally want to have in a dictionary. Whether this bothers the end-users and whether they even notice is an open question.

All these trends are relatively new and are far from established practice yet. Some dictionary makers are experimenting with these while others are not even aware of them yet. An additional, recently emerged trend which is even further from everyday practice yet, is using generative AI to automate certain lexicographic tasks, especially those that have previously resisted automation, such as definition writing. Lexicographic definitions are notoriously difficult to extract from corpuses because authentic non-dictionary texts do not normally contain sentences of that type (“a window is a space usually filled with glass in the wall of a building”). It seems that, with clever prompting and clever use of few-shot learning techniques, large language models such as ChatGPT are able to generate dictionary-style definitions to a standard comparable with those written by human lexicographers.

The summary is that this particular stage of the lexicographic process – the stage when lexicological knowledge is being converted into lexicographic content – is currently undergoing rapid innovation and is being subject to a strong push towards automation. This will probably force a redefinition and renegotiation of the roles of humans and machines in the entire process.

Dictionary writing systems and what is inside them

The classical data structure for lexicographic content is an entry. Lexicographers typically use specialised software, a dictionary writing system, for editing a dictionary. Current widely used dictionary writing systems are the IDM Dictionary Production System, TLex, iLex and Lexonomy.

Using specialised dictionary software is commonplace on dictionary projects today. At first glance, it might seem that this is another example of deep and thorough digitisation: nobody seriously considers writing a dictionary in an ordinary word processor any more. But, as I have argued in my PhD thesis, we have not yet exhausted all the potential that the digital medium offers. Practically all current dictionary writing systems represent dictionary entries as isolated tree structures, usually encoded in XML – the software is not much more than a glorified XML editor. My entire thesis is basically a critique of this position: I argue there that keeping dictionaries in a purely tree-structured data model imposes certain inconvenient limits and causes problems which could be solved by re-engineering dictionaries into a more flexible, partially graph-based data model.

Explaining the details of this would be enough material for a separate article. Let it just be said at this point that, from a data-modelling perspective, lexicography has only undergone a rather shallow form of digitisation so far, and that there is much to be done yet.

The future of human-dictionary interaction

When dictionaries migrated from the pages of books onto computer and phone screens in the last two decades, it was a big change for the better for the end user. The main improvement is that it made searching faster and easier. In paper dictionaries, the user had no choice but to be his or her own search engine: people searched by turning the pages with their fingers, navigating alphabetically. This is a process which takes time and puts a cognitive load on the person: your attention is distracted from whatever you were doing before, such as reading or writing, by having to search the dictionary. Computers have allowed us to take this cognitive burden off ourselves and outsource it onto a machine. This has made it much easier for people to use dictionaries. (Anecdotal evidence suggests – even if there is no data to prove it – that people consult dictionaries more often today, in the digital era, than they used to back when all dictionaries were on paper. This must be because dictionaries are easier to use now: human nature dictates that the easier something is, the more are people likely to do it.)

Some people are still deeply fascinated by this innovation. But for most computer users today, many of whom are digital natives, this evolutionary step is something that happened a long time ago. Digital dictionaries are the normal state of affairs. It is now time to start asking what the next evolutionary step in human-dictionary interaction will be.

One emerging user requirement is aggregation: people increasingly express a desire to search many dictionaries at once. The current situation is that, in each language and in each language pair, users usually have a choice of multiple online dictionaries and dictionary-like products which may or may not satisfy their current information need: the user has to visit each website individually to see if it has the information the user is looking for. This can be an arduous slog around the Internet: every dictionary website is a little different, some are user-friendly and ergonomic, others not so much, you have to know them, know their addresses, know the strengths and weaknesses of their search algorithms. It is a large cognitive load. Can it be automated?

One strategy is to use a generic search engine like Google – but generic search engines often misconstrue a lexicographic query for an encyclopedic one: “tell me about the word cat” versus “tell me about cats”. Another option is to use one of the few existing dictionary-specific meta-search engines and aggregators such as OneLook and the European Dictionary Portal – but their problem often is that they do not cover the languages, language pairs or individual dictionaries the user wants.

The road to a better aggregation of dictionary websites is currently blocked by several obstacles. One obstacle is the absence of widely respected standards for exposing dictionary metadata on the Internet: a machine-readable vocabulary which any dictionary website could use to tell the world about the headwords it contains, in which languages they are, and so on. This is a technical hurdle. The second obstacle is more human: publishers tend to be reluctant to make their content available to third parties. Most organisations that publish online dictionaries today prefer to do so on their own websites, under their own logos, with their own identities. This is understandable for commercial publishers, but non-commercial and academic institution have this tendency too. In spite of these handicaps, some form of aggregation on tomorrow’s “Internet of dictionaries” is probably unavoidable. For once, it is what users want and, secondly, it is already happening in other information disciplines, mainly in libraries and in scientific publications: open metadata, all kinds of portals and metasearch engines are already commonplace there today.

A second emerging trend is for dictionaries to become integrated into other tools, even to such an extent that the dictionary becomes invisible. The motivation is again to minimise the cognitive load associated with consulting a dictionary. A dictionary is something people use when they are doing something else, typically reading or writing. While reading or writing, an information need may emerge in the reader’s or writer’s mind, a need which must be satisfied before the user can or wants to continue: perhaps because he or she does not understand a phrase or is not sure how best to express an idea. This is when people decide to go to a dictionary, but this comes at the cost of becoming distracted and perhaps losing track of what you were doing before.

This is why we are beginning to see experiments with digital tools which eliminate the need to “go” anywhere at all: the user can satisfy his or her information needs right there in the current context, without having to – for example – switch to a different browser tab or window. An example is the experimental tool ColloCaid which, while writing in a second language, suggests typical collocations on the spot, without the need to go anywhere or search for anything. Writing tools are not a new genre by any means (spellcheckers and grammar checkers have existed for decades), what is new is the fusion between them and subsets of what would traditionally be called “lexicography“ (in ColloCaid’s case, the cataloguing of collocations). While writing tools are a well-known genre, “reading tools” are not and that is perhaps why nobody has built a hypothetical “clicktionary” yet, a tool which would let the user click on any word anywhere and which would not only bring up the correct dictionary entry but would also highlight the correct sense inside the entry.

Both these trends – the trend towards aggregation and the trend towards invisibility – are in extremely early stages yet. The current state of the art is more mundane and prosaic: we have dictionary websites and dictionary apps which, while offering a slightly better user experience over printed dictionaries, are not really bringing anything qualitatively new, anything the printed dictionaries were not doing already. Interaction between humans and dictionaries is therefore an area in a relatively shallow state of digitisation, an area where radical innovation is waiting to happen yet.

Summary: digitisation deep and shallow

This article has analysed the process of making and delivering a dictionary as something that unfolds in stages, and we have shown how the different stages have become digitised to different depths. The initial stage of the process – knowledge acquisition – is now so deeply digitised that hardly any qualitatively new developments are expected any more, while all the other stages – from what happens in dictionary writing systems to what eventually lands on an end-user’s screen – are still in a shallow state of digitisation and there is potential for qualitative jumps to completely new levels.

This is normal for any industry which is undergoing a digital transformation. In one influential book on digital transformation in business, the authors distinguish between two stages of digitisation: early digitisation is when existing business models and processes are merely improved by becoming digital, while the later stage, when the business truly becomes a “digital business”, is when the digital infrastructure enables the discovery of completely new, digital-only models and offerings which never existed before. This corresponds to the distinction between shallow and deep digitisation in this article.

The fact that shallow comes before deep appears to be a feature of technological progress generally. Steven Pemberton makes a similar observation on innovations that happened long ago: “Whenever a new technology is introduced, it imitates the old. Early cars looked like horseless carriages because that is exactly what they were. [...] It took a long time for cars to evolve into what we now know.” And: “For the first 50 years, [printed] books looked just like manuscripts: hand-writing fonts, no page numbers, no table of contents, or index. Why? That was what was expected of a book at the time. [...] After about 50 years, readable fonts were introduced.”

When an old technology imitates the old and only improves it a little, that is a shallow form of innovation. That’s where we still are in a lot of lexicography today.

Michal Měchura, 2025-08-18

michmech@lexiconista.com