Tibetan Modern U-chen Print 0.1

Tibetan Modern U-chen Print 0.1 (TMUP 0.1) is the first Transkribus HTR model for printed Tibetan language publications in Uchen (དབུ་ཅན་ dbu can) script. It has been trained on texts that were published in the PRC between the 1950s and 1980s. The model was trained on 522 pages in 20 documents. The training set consists of 470 pages; the validation set consists of 52 (10%) automatically selected pages. No base model was used. The model was developed by Franz Xaver Erhard (Leipzig University) and Xiaoying 笑影 (Leipzig University) for the Divergent Discourses project (DFG/AHRC).

The model is publicly available within the Transkribus environment. You can view and test the model at

https://readcoop.eu/model/tibetan-modern-u-chen-print/

Details on the model and the Ground Truth of the training set can be viewed on the Transkribus-Site of the Divergent Discourses project:

https://app.transkribus.org/sites/uchan

The training set to the model – consisting of the image files (jpg) and the corresponding Transkribus pageXML files – is available for download from:

Erhard, Franz Xaver, Xiaoying 笑影, Barnett, Robert, Hill, Nathan W., 2024. Tibetan Modern U-chen Print (TMUP) 0.1: Training Data for a Transkribus HTR Model for Modern Tibetan Printed Texts. https://doi.org/10.48796/20240313-0000352

Transkribus_utils: Paragraph Extractor: A tool to extract text from Transkribus pageXML

Transkribus transcribes the text on a given page line by line and doesn’t discriminate between different forms of formatting such as headings, marginalia or footnotes. To meaningfully structure and separate the transcribed text into smaller units is crucial for text and corpus analysis, e.g., with the Leipzig Corpus Miner (iLCM). To retrieve structured plain text from Transkribus pageXML, further processing is neccessary.

This repository holds utilities for parsing and extracting useful data from Transkribus PageXML outputs, such as a utility for identifying text regions (Paragraph Extractor), and a utility to reconcile Trankribus output metadata with the equivalent data in relevant library catalogues (coming shortly).

Engels, J., Robert Barnett, Erhard, F. X., & Hill, N. (2024). Transkribus_utils: Paragraph Extractor (v1_Paragraph_Extractor). Zenodo. https://doi.org/10.5281/zenodo.10810509

View code on Diverge Github

Paragraph Extractor is a utility that accepts Transkribus PageXML as input and then interprets the text regions on each page/image (such as headers, titles, blocks of text, etc.), which we term “paragraphs”. It then returns the raw text of each text region (paragraph) along with its metadata. Note that it reads PageXML, not AltoXML.

Paragraph Extractor was developed by James Engels of SOAS University of London for the Divergent Discourses project.