19th Century Romanian Transitional Alphabet Transliteration Project
The early 19th-century Romania
saw the appearance and evolution of several transitional scripts mixing the Cyrillic with the Latin scripts
in an effort by various scholars to migrate all texts to Latin script by the second part of the century. As a result, many texts from that period were written using this script. While many of the regular Cyrillic characters were found in the Romanian version, some were unique to it. Currently, these texts can be found either as original manuscripts or scanned documents within libraries across Romania. Their study is difficult not only because there were many versions of the transitional script, but also because modern researchers are unfamiliar with the script itself. OCR software is also faced with the challenge of understanding the special characters and fonts employed in the 19th century producing as a result incomprehensible text documents. A preliminary literature survey has outlined that little research has been made on the subject, recent (and most relevant) results being published by researchers from the Science Academy of Moldova:
Through this project we will overcome the limits of current technologies by employing state-of-the-art open source machine learning
techniques capable of learning to recognize the symbols of the various Romanian transitional scripts. In addition, to ease the effort of modern researchers we will produce text documents transliterated in the modern Romanian Latin script as well as an online service
that scholars can use to automatically transliterate their 19th century texts.
For that we will rely on the Tesseract software
and train from a diverse range of 19th-century texts an accurate AI model that can be used to recognize characters in the transitional scripts. Its validation and the text transliteration will be performed on a distinct set of 19th-century documents. The entire workflow will be automated and exposed as a demo web application validated in lab conditions.
The project aim is to offer a working solution for transliterating (by phonetic transcription) the 19th-century Romanian Transitional Script (RTS) into the modern latin alphabet.
1st July 2022 - 30th June 2024
- A machine learning-based (ML) learning model for to converting scanned images of Romanian texts written in the transitional script into text using the modern Latin script.
- A proof of concept online platform where scholars can upload their scanned documents and retrieve a transliterated document in MS Word format.