Public data

Our project offers the following publicly available data for researchers and anyone interested in developing AI models for RTS.

A public dataset of 156 scanned images including ground truth text files (text found in the images) and box files for building Tesseract AI models. The dataset can be used with our online text editor.
A pretrained Transkribus model. Trained on the public dataset starting from a model for the Russian language for 250 iterations. The model can be used in the Transkribus lite platform.
A pretrained Tesseract 5.1 model. Trained on the public dataset starting from a model for the Cyrillic script for 50,000 iterations. The model can be used from applications using Tesseract OCR API.
The set of rules for performing phonetic transcription of Cyrillic characters into modern Latin characters/group of characters used in the Romanian language. (TBA)