19th Century Romanian Transitional Alphabet Transliteration Project
The early
19th-century Romania saw the appearance and evolution of
several transitional scripts mixing the Cyrillic with the Latin scripts in an effort by various scholars to migrate all texts to Latin script by the second part of the century. As a result, many texts from that period were written using this script. While many of the regular Cyrillic characters were found in the Romanian version, some were unique to it. Currently, these texts can be found either as original manuscripts or scanned documents within libraries across Romania. Their study is difficult not only because there were many versions of the transitional script, but also because modern researchers are unfamiliar with the script itself. OCR software is also faced with the challenge of understanding the special characters and fonts employed in the 19th century producing as a result incomprehensible text documents. A preliminary literature survey has outlined that little research has been made on the subject, recent (and most relevant) results being published by researchers from the Science Academy of Moldova:
Through this project we will overcome the limits of current technologies by employing state-of-the-art open source
machine learning techniques capable of learning to recognize the symbols of the various Romanian transitional scripts. In addition, to ease the effort of modern researchers we will produce text documents transliterated in the modern Romanian Latin script as well as an
online service that scholars can use to automatically transliterate their 19th century texts.
For that we will rely on the
Tesseract software and train from a diverse range of 19th-century texts an accurate AI model that can be used to recognize characters in the transitional scripts. Its validation and the text transliteration will be performed on a distinct set of 19th-century documents. The entire workflow will be automated and exposed as a demo web application validated in lab conditions.
Aim
The project aim is to offer a working solution for transliterating (by phonetic transcription) the 19th-century Romanian Transitional Script (RTS) into the modern latin alphabet.
Duration
1st July 2022 - 30th June 2024
Expected outcomes
- A machine learning-based (ML) learning model for to converting scanned images of Romanian texts written in the transitional script into text using the modern Latin script.
- A proof of concept online platform where scholars can upload their scanned documents and retrieve a transliterated document in MS Word format.
Dissemination
Publications
- M. Frincu, S. Frincu, M. Penteliuc, Challenges and Solutions in Transliterating 19th Century Romanian Texts from the Transitional to the Latin Script, Procs. LDK 2023, pp. 226-231.
- M. Frincu, M. Penteliuc, S. Frincu, M. Zanescu, G. Bran, Comparing ML OCR Engines on Texts from 19th Century Written in the Romanian Transitional Script, Procs. IEEE BigData.
- S. Frincu, M Frincu, Enabling Interdisciplinary Learning and Research Through AI Driven Transliteration of Historical Documents: A Case Study Proposal from Digital Humanities, The International Journal of Humanities Education (accepted in print).
Technical reports
- Three annual tehnical reports (2022, 2023, 2024) and a final project report have been submitted to the contracting authority.