data set – Divergent Discourses

Tibetan Modern U-chen Print 0.1 (TMUP 0.1) is the first Transkribus HTR model for printed Tibetan language publications in Uchen (དབུ་ཅན་ dbu can) script. It has been trained on texts that were published in the PRC between the 1950s and 1980s. The model was trained on 522 pages in 20 documents. The training set consists of 470 pages; the validation set consists of 52 (10%) automatically selected pages. No base model was used. The model was developed by Franz Xaver Erhard (Leipzig University) and Xiaoying 笑影 (Leipzig University) for the Divergent Discourses project (DFG/AHRC).

The model is publicly available within the Transkribus environment. You can view and test the model at

https://readcoop.eu/model/tibetan-modern-u-chen-print/

Details on the model and the Ground Truth of the training set can be viewed on the Transkribus-Site of the Divergent Discourses project:

https://app.transkribus.org/sites/uchan

The training set to the model – consisting of the image files (jpg) and the corresponding Transkribus pageXML files – is available for download from:

Erhard, Franz Xaver, Xiaoying 笑影, Barnett, Robert, Hill, Nathan W., 2024. Tibetan Modern U-chen Print (TMUP) 0.1: Training Data for a Transkribus HTR Model for Modern Tibetan Printed Texts. https://doi.org/10.48796/20240313-000

The following dataset consists of a Tibetan language model for SpaCy and a list of Tibetan stopwords to enable Tibetan language support in the integrated Leipzig Corpus Miner (iLCM); it includes instructions to upload the model into the iLCM framework.

Engels, James, Erhard, Franz Xaver, Barnett, Robert, & Hill, Nathan W. (2023). Tibetan for Spacy 1.1 (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10120779.

A significant obstacle to using major state-of-the-art NLP applications in Tibetan studies is the lack of support for the Tibetan language. Since the DIVERGE project aims to analyse several thousand pages of Tibetan newspapers, it depends on applications such as the integrated Leipzig Corpus Miner (iLCM).

The iLCM is an interface-based text and corpus mining software package capable of a range of NLP tasks, such as frequency analysis and topic modelling. The iLCM is a wrapper on functions accessible through SpaCy but avoids the scripting requirement of a boilerplate spacy model. It requires a small language model built from the SpaCy NLP package as input. SpaCy has native support for several high- and high-intermediate resource languages and is an industry-standard software package for small-scale English and other European-language NLP integration in a variety of research and corporate environments. No major small language model NLP package available today has native support for Tibetan at any step of the pipeline.

Only little training data for a language model is available for Tibetan as a low-resource language. Nevertheless, to get started, we have developed a preliminary Tibetan language model for SpaCy with limited available training data in CoNLL-U, preprocessing it with BoTok.

The iLCM does not have a native Tibetan model and does not natively support every language supported by SpaCy but is packaged with a few common European languages (English, French, German, Italian, Spanish, and Portuguese). A new model must be separately uploaded with a separate document including an explicit list of stopwords (generally semantically vacuous function words that are not useful for tasks like topic modelling).

Tag: data set

Tibetan Modern U-chen Print 0.1

Tibetan language support in iLCM