19th Century Romanian Transitional Alphabet Transliteration Project

The early 19th-century Romania saw the appearance and evolution of several transitional scripts mixing the Cyrillic with the Latin scripts in an effort by various scholars to migrate all texts to Latin script by the second part of the century. As a result, many texts from that period were written using this script. While many of the regular Cyrillic characters were found in the Romanian version, some were unique to it. Currently, these texts can be found either as original manuscripts or scanned documents within libraries across Romania. Their study is difficult not only because there were many versions of the transitional script, but also because modern researchers are unfamiliar with the script itself. OCR software is also faced with the challenge of understanding the special characters and fonts employed in the 19th century producing as a result incomprehensible text documents. A preliminary literature survey has outlined that little research has been made on the subject, recent (and most relevant) results being published by researchers from the Science Academy of Moldova: Through this project we will overcome the limits of current technologies by employing state-of-the-art open source machine learning techniques capable of learning to recognize the symbols of the various Romanian transitional scripts. In addition, to ease the effort of modern researchers we will produce text documents transliterated in the modern Romanian Latin script as well as an online service that scholars can use to automatically transliterate their 19th century texts. For that we will rely on the Tesseract software and train from a diverse range of 19th-century texts an accurate AI model that can be used to recognize characters in the transitional scripts. Its validation and the text transliteration will be performed on a distinct set of 19th-century documents. The entire workflow will be automated and exposed as a demo web application validated in lab conditions.

Aim

The project aim is to offer a working solution for transliterating (by phonetic transcription) the 19th-century Romanian Transitional Script (RTS) into the modern latin alphabet.

Duration

1st July 2022 - 30th June 2024

Expected outcomes

Dissemination

Publications

  1. M. Frincu, S. Frincu, M. Penteliuc, Challenges and Solutions in Transliterating 19th Century Romanian Texts from the Transitional to the Latin Script, Procs. LDK 2023, pp. 226-231.

Technical reports

  1. Two annual tehnical reports (2022, 2023) have been submitted to the contracting authority