How exactly are Kindle dictionaries made?
The format for Kindle dictionaries is somewhat documented by Amazon on this page. By the looks of it, the format has remained fairly consistent across the years and across Kindle versions.
In its simplest form, a Kindle dictionary is a collection of:
- A cover image
- An OPF manifest file, which encodes the dictionary source and target language and specifies which index used to look up words
- One or more HTML files with the body of the dictionary (using a few special tags and parameter to determine the index entries and alternative lookups for inflected forms)
In addition to these, it’s possible to add CSS files for optional styling of the dictionary contents.
Amazon provided a command-line binary utility called kindlegen
that would compile and compress the dictionary building blocks (cover image, OPF manifest, HTML content) into a .MOBI file. kindlegen
was discontinued sometimes in 2020 and replaced by the Kindle Previewer, which has a graphical interface for testing and compiling new ebooks but unfortunately is not supported on Linux machines (which I tend to use whenever I work on a development project). Hence I ferreted out an old copy of kindlegen
from a dark corner of the Internet and I’m using that as part of the project.
How is Skarb made?
To generate the corpus I use the following sources:
- A data dump (obtained through the wikiextract Python library) of the English-Polish Wiktionary corpus; as a student of Polish I generally found Wiktionary entries quite good, with lots of useful information (from etymologies to links between perfective and imperfective verb forms)
- The
morfeusz2
library, developed as part of the online version of the Grammatical Dictionary of Polish; the library (which comes with Python bindings) is able to analyse and generate inflected or conjugated forms for Polish words and it’s key in improving the dictionary user experience (as it’s able to very quickly generate good-quality data for alternative word lookups) - The collected lemmas of the Grammatical Dictionary of Polish (scraped via a Python script and translated via the Google Translation API)
I then use a small Python script to:
- Extract and parse all entries from Wiktionary
- Add all machine-translated entries from SGJP
- Generate inflected forms of all entries through
morfeusz2
- Generate an HTML file using the entry and inflected forms data
- Generate a dictionary .MOBI file in the correct format through
kindlegen