19th Century Romanian Transitional Alphabet Transliteration Project

The early 19th-century Romania saw the appearance and evolution of several transitional scripts mixing the Cyrillic with the Latin scripts in an effort by various scholars to migrate all texts to Latin script by the second part of the century. As a result, many texts from that period were written using this script. While many of the regular Cyrillic characters were found in the Romanian version, some were unique to it. Currently, these texts can be found either as original manuscripts or scanned documents within libraries across Romania. Their study is difficult not only because there were many versions of the transitional script, but also because modern researchers are unfamiliar with the script itself. OCR software is also faced with the challenge of understanding the special characters and fonts employed in the 19th century producing as a result incomprehensible text documents. A preliminary literature survey has outlined that little research has been made on the subject, recent (and most relevant) results being published by researchers from the Science Academy of Moldova:

V. Demidova, L. Burtseva (2017), Digitizarea si prezentarea alfabetului chirilic romanesc de tranzitie (1830-1860), Akademos vol. 1.
S. Cojocaru, et al., On technology for digitization of Romanian historical heritage printed in the Cyrillic script, Procs. MFOI 2016.

Through this project we will overcome the limits of current technologies by employing state-of-the-art open source machine learning techniques capable of learning to recognize the symbols of the various Romanian transitional scripts. In addition, to ease the effort of modern researchers we will produce text documents transliterated in the modern Romanian Latin script as well as an online service that scholars can use to automatically transliterate their 19th century texts. For that we will rely on the Tesseract software and train from a diverse range of 19th-century texts an accurate AI model that can be used to recognize characters in the transitional scripts. Its validation and the text transliteration will be performed on a distinct set of 19th-century documents. The entire workflow will be automated and exposed as a demo web application validated in lab conditions.

Aim

The project aim is to offer a working solution for transliterating (by phonetic transcription) the 19th-century Romanian Transitional Script (RTS) into the modern latin alphabet.

Duration

1st July 2022 - 30th June 2024

Expected outcomes

A machine learning-based (ML) learning model for to converting scanned images of Romanian texts written in the transitional script into text using the modern Latin script.
A proof of concept online platform where scholars can upload their scanned documents and retrieve a transliterated document in MS Word format.

Dissemination

Scientific Seminar Talk (Computer Science Dept. - West University of Timisoara Romania), January 25, 2023.
Our abstract Enabling Interdisciplinary Learning and Research through AI-driven Transliteration of Historical Documents by our team members Simina Frincu and Manuela Zanescu has been accepted for presentation at the 2023 21st Conference on New Directions in Humanities - Literary landscapes: Forms of knowledge in the Humanities
Our paper Challenges and Solutions in Transliterating 19th Century Romanian Texts from the Transitional to the Latin Script has been accepted at the 2023 Conference on Language Data and Knowledge (LDK). Included in the CORE list.
The dataset used for training our models available on kaggle.com
Our paper Comparing ML OCR Engines on Texts from 19th Century Written in the Romanian Transitional Script has been accepted at the 2023 IEEE Big Data Conference (Core B).
Our paper Enabling Interdisciplinary Learning and Research Through AI Driven Transliteration of Historical Documents: A Case Study Proposal from Digital Humanities by our team members Simina Frincu and Marc Frincu has been accepted for publication in the The International Journal of Humanities Education.
Our results were presented at the BCUT Workshop on the use of AI in libraries by Manuela Zanescu and Simina Frincu, April 25, 2024.

Publications

M. Frincu, S. Frincu, M. Penteliuc, Challenges and Solutions in Transliterating 19th Century Romanian Texts from the Transitional to the Latin Script, Procs. LDK 2023, pp. 226-231.
M. Frincu, M. Penteliuc, S. Frincu, M. Zanescu, G. Bran, Comparing ML OCR Engines on Texts from 19th Century Written in the Romanian Transitional Script, Procs. IEEE BigData.
S. Frincu, M Frincu, Enabling Interdisciplinary Learning and Research Through AI Driven Transliteration of Historical Documents: A Case Study Proposal from Digital Humanities, The International Journal of Humanities Education (accepted in print).

Technical reports

Three annual tehnical reports (2022, 2023, 2024) and a final project report have been submitted to the contracting authority.