19th Century Romanian Transitional Alphabet Transliteration Project
The early
19th-century Romania saw the appearance and evolution of
several transitional scripts mixing the Cyrillic with the Latin scripts in an effort by various scholars to migrate all texts to Latin script by the second part of the century. As a result, many texts from that period were written using this script. While many of the regular Cyrillic characters were found in the Romanian version, some were unique to it. Currently, these texts can be found either as original manuscripts or scanned documents within libraries across Romania. Their study is difficult not only because there were many versions of the transitional script, but also because modern researchers are unfamiliar with the script itself. OCR software is also faced with the challenge of understanding the special characters and fonts employed in the 19th century producing as a result incomprehensible text documents. A preliminary literature survey has outlined that little research has been made on the subject, recent (and most relevant) results being published by researchers from the Science Academy of Moldova:
Through this project we will overcome the limits of current technologies by employing state-of-the-art open source
machine learning techniques capable of learning to recognize the symbols of the various Romanian transitional scripts. In addition, to ease the effort of modern researchers we will produce text documents transliterated in the modern Romanian Latin script as well as an
online service that scholars can use to automatically transliterate their 19th century texts.
For that we will rely on the
Tesseract software and train from a diverse range of 19th-century texts an accurate AI model that can be used to recognize characters in the transitional scripts. Its validation and the text transliteration will be performed on a distinct set of 19th-century documents. The entire workflow will be automated and exposed as a demo web application validated in lab conditions.
Aim
The project aim is to offer a working solution for transliterating (by phonetic transcription) the 19th-century Romanian Transitional Script (RTS) into the modern latin alphabet.
Duration
1st July 2022 - 30th June 2024
Expected outcomes
- A machine learning-based (ML) learning model for to converting scanned images of Romanian texts written in the transitional script into text using the modern Latin script.
- A proof of concept online platform where scholars can upload their scanned documents and retrieve a transliterated document in MS Word format.
Dissemination
Publications
TBA