Transkribus – Divergent Discourses

CrossAsia Talk: From Print to Digital: Making Available Tibetan Newspapers as a Historical Source

Posted on 6. December 2024 by Xaver in Allgemein, Events, Outputs

Speaker: Franz Xaver Erhard

When: December 12, 2024 at 6 pm (CET)

Where: Online via Webex (Join now!)

The Sino-Tibetan history of the 1950s and 1960s is relatively unknown and highly contested. At the same time, sources on the period are scarce and local archives – if they exist – are generally closed to outside researchers. The few existing collections, including the one at the Staatsbibliothek zu Berlin, of Tibetan newsprint and contemporary publications offer rare insights into the events but also the official presentation of events at the very time when they were taking place. The UK-German research project Divergent Discourses takes up this opportunity to study the events and narratives in newspapers of the period to understand how they became woven into cohesive yet diverging discourses on Tibet.

In the field of Tibetan Studies, Digital Humanities approaches are just emerging. Often, the essential tools are still wanted – the Divergent Discourses project has grappled with a multitude of challenges to digitisation posed by the Tibetan language and script, the complexity of newspaper layout, and the lack of Natural Language Processing tools for Tibetan and thus adapted existing or created new tools to build a workflow for the digitisation and analysis of a modern Tibetan text corpus.

The presentation will showcase the Divergent Discourses project’s approaches and Digital Humanities tools geared to unlock a large corpus of Tibetan historical newspapers for the first time as a source for a historical study of the emergence and development of conflicting concepts, ideas and discourse strategies.

The lecture will be held in English. If you have any questions, don’t hesitate to get in touch with the Crossasia team at . The event will be recorded.

The lecture will also be streamed via Webex. You can participate in the lecture using your browser without installing a special software. Please click on “Join the talk now!” below, follow the link “join via browser” (“über Browser teilnehmen”), and enter your name.

Join the talk now!

Find out more on the CrossAsia blog

Talk: Text-mining propaganda: Studying Tibetan newspapers, 1950s to early 1960s

Posted on 4. November 20246. December 2024 by Xaver in Allgemein, Events, Outputs

Date 21 November 2024
Time 5:00 pm to 6:30 pm
Venue Paul Webley Wing (Senate House), SOAS University of London
Room Wolfson Lecture Theatre (SWLT)

Abstract

The 1950s and 1960s were pivotal periods both for the People’s Republic of China and on the Tibetan plateau, where major social and political change took place, including the absorption of Tibet into the new Chinese state and the flight of the Dalai Lama into exile.

Study of events in Tibet in this period is, however, particularly difficult because archival access is heavily restricted. As a result, historians have to turn largely to propaganda materials from that time, such as the newspapers produced by either side, as primary sources for the study of that period.

The Divergent Discourse project is trialling the use of newspapers from that time as historical sources and developing techniques for analysing their content. Working with such texts, however, presents numerous difficulties, ranging from distinguishing informational from polemical content to the challenges of digitising Tibetan language and script. In this presentation, we discuss the use of digital humanities tools to mine the corpus we have compiled of Tibetan-language newspapers produced within China and the exile Tibetan community in the 1950s and 1960s, and the prospects of discourse-tracing as a strategy for the historical study of that period.

Speakers

Dr Franz Xaver Erhard (Leipzig) is the PI of the German part of the Divergent Discourses project. He is a philologist specialising in modern Tibetan literature and early Tibetan-language newspapers. He obtained his PhD in Tibetology from Leipzig University and has taught Tibetan in Berlin, Oxford and Leipzig.

Dr Robert Barnett is the PI of the UK part of the Divergent Discourses project, funded by the AHRC. He works on nationality issues in China and modern Tibetan history, politics and culture and is a Professor and Senior Research Fellow at SOAS and an affiliate lecturer and research affiliate at King’s College London. Recent publications and edited volumes include Forceful Diplomacy (Turquoise Roof, 2024); Conflicted Memories with Benno Weiner and Françoise Robin (Brill, 2020); Tibetan Modernities: Notes from the Field with Ronald Schwartz (Brill, 2008); and Lhasa: Streets with Memories (Columbia, 2006).

Organiser

This event is co-hosted by the SOAS China Institute and the SOAS China and Inner Asia Section.

Tibetan Modern U-chen Print 0.1

Posted on 14. March 20244. June 2024 by Xaver in Datasets, Models

Tibetan Modern U-chen Print 0.1 (TMUP 0.1) is the first Transkribus HTR model for printed Tibetan language publications in Uchen (དབུ་ཅན་ dbu can) script. It has been trained on texts that were published in the PRC between the 1950s and 1980s. The model was trained on 522 pages in 20 documents. The training set consists of 470 pages; the validation set consists of 52 (10%) automatically selected pages. No base model was used. The model was developed by Franz Xaver Erhard (Leipzig University) and Xiaoying 笑影 (Leipzig University) for the Divergent Discourses project (DFG/AHRC).

The model is publicly available within the Transkribus environment. You can view and test the model at

https://readcoop.eu/model/tibetan-modern-u-chen-print/

Details on the model and the Ground Truth of the training set can be viewed on the Transkribus-Site of the Divergent Discourses project:

https://app.transkribus.org/sites/uchan

The training set to the model – consisting of the image files (jpg) and the corresponding Transkribus pageXML files – is available for download from:

Erhard, Franz Xaver, Xiaoying 笑影, Barnett, Robert, Hill, Nathan W., 2024. Tibetan Modern U-chen Print (TMUP) 0.1: Training Data for a Transkribus HTR Model for Modern Tibetan Printed Texts. https://doi.org/10.48796/20240313-000

Transkribus_utils: Paragraph Extractor: A tool to extract text from Transkribus pageXML

Posted on 13. March 202413. March 2024 by Xaver in Code

Transkribus transcribes the text on a given page line by line and doesn’t discriminate between different forms of formatting such as headings, marginalia or footnotes. To meaningfully structure and separate the transcribed text into smaller units is crucial for text and corpus analysis, e.g., with the Leipzig Corpus Miner (iLCM). To retrieve structured plain text from Transkribus pageXML, further processing is neccessary.

This repository holds utilities for parsing and extracting useful data from Transkribus PageXML outputs, such as a utility for identifying text regions (Paragraph Extractor), and a utility to reconcile Trankribus output metadata with the equivalent data in relevant library catalogues (coming shortly).

Engels, J., Robert Barnett, Erhard, F. X., & Hill, N. (2024). Transkribus_utils: Paragraph Extractor (v1_Paragraph_Extractor). Zenodo. https://doi.org/10.5281/zenodo.10810509

View code on Diverge Github

Paragraph Extractor is a utility that accepts Transkribus PageXML as input and then interprets the text regions on each page/image (such as headers, titles, blocks of text, etc.), which we term “paragraphs”. It then returns the raw text of each text region (paragraph) along with its metadata. Note that it reads PageXML, not AltoXML.

Paragraph Extractor was developed by James Engels of SOAS University of London for the Divergent Discourses project.