Outputs & Events

Talk: Text-mining propaganda: Studying Tibetan newspapers, 1950s to early 1960s

Date 21 November 2024
Time 5:00 pm to 6:30 pm
Venue Paul Webley Wing (Senate House), SOAS University of London
Room Wolfson Lecture Theatre (SWLT)

Abstract

The 1950s and 1960s were pivotal periods both for the People’s Republic of China and on the Tibetan plateau, where major social and political change took place, including the absorption of Tibet into the new Chinese state and the flight of the Dalai Lama into exile. 

Study of events in Tibet in this period is, however, particularly difficult because archival access is heavily restricted. As a result, historians have to turn largely to propaganda materials from that time, such as the newspapers produced by either side, as primary sources for the study of that period. 

The Divergent Discourse project is trialling the use of newspapers from that time as historical sources and developing techniques for analysing their content. Working with such texts, however, presents numerous difficulties, ranging from distinguishing informational from polemical content to the challenges of digitising Tibetan language and script. In this presentation, we discuss the use of digital humanities tools to mine the corpus we have compiled of Tibetan-language newspapers produced within China and the exile Tibetan community in the 1950s and 1960s, and the prospects of discourse-tracing as a strategy for the historical study of that period.

Speakers

Dr Franz Xaver Erhard (Leipzig) is the PI of the German part of the Divergent Discourses project. He is a philologist specialising in modern Tibetan literature and early Tibetan-language newspapers. He obtained his PhD in Tibetology from Leipzig University and has taught Tibetan in Berlin, Oxford and Leipzig.

Dr Robert Barnett is the PI of the UK part of the Divergent Discourses project, funded by the AHRC. He works on nationality issues in China and modern Tibetan history, politics and culture and is a Professor and Senior Research Fellow at SOAS and an affiliate lecturer and research affiliate at King’s College London. Recent publications and edited volumes include Forceful Diplomacy (Turquoise Roof, 2024); Conflicted Memories with Benno Weiner and Françoise Robin (Brill, 2020); Tibetan Modernities: Notes from the Field with Ronald Schwartz (Brill, 2008); and Lhasa: Streets with Memories (Columbia, 2006).

Organiser

This event is co-hosted by the SOAS China Institute and the SOAS China and Inner Asia Section.

Modern-Botok. Custom dictionary for modern Tibetan

Tsikchen.tsv is a customised dictionary to be integrated into the Tibetan tokenizer BoTok. BoTok can tokenize classical Tibetan text or traditional genres out of the box. However, since it depends on a dictionary for tokenization, it lacks capabilities for modern Tibetan, in particular, the language of modern newspapers published in the PRC or on the subcontinent. Adding this customised dictionary adds functionality for modern Tibetan to BoTok.

Erhard, F. Xaver & Kyogoku, Yuki. (2024). Modern-Botok. Custom dictionary for modern Tibetan (v0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14034747

The custom dictionary tsikchen was compiled from Christian Steinert’s collection and contains the following dictionaries:

  1. Grand Monlam Dictionary (default dictionary of Botok)
  2. Jim Valby
  3. Ives Waldo
  4. Dan Martin
  5. Tshig mdzod chen mo
  6. Dung dkar
  7. Tibetan Terminology Project

The Divergent Discourses project cleaned up and edited the resulting dictionary to the project’s requirements (removal of double entries, phraseologisms, ungrammatical entries, etc; addition of ca. 1000 personal and place names).

For its installation, see the Divergent Discourses’ modern-botok repository on github.

new blog post: “Narratives, Newspapers and The Tibetan-China Dispute: The Divergent Discourses Project”

Since the interpretive turn in the social sciences some fifty years ago, the study of conflict, and of long-running disputes in particular, has given prominence to the role of narrative and discourse in the perpetuation of antagonisms. As Peter Coleman wrote in a much-cited essay in the 1990s, conflicts of all kinds are driven and sustained by stories: once “contradictory narratives emerge for each of the disputing groups and become promoted to unquestioned fact or truth,” he wrote, those disputes “often cross a threshold into intractability”. But how do such narratives emerge? How do they relate to the original events that triggered the dispute? And how much change do these narratives undergo in their early stages?

Divergent Discourses is a joint SOAS-Leipzig University project, funded by the UK and German research bodies (the AHRC and DFG), that aims to explore these questions by studying the earliest accounts of the Sino-Tibetan conflict. That conflict began with the entry of China’s People’s Liberation Army into Tibet 74 years ago. At that time, the two parties to the dispute immediately turned to public media – primarily to newspapers – to convey their interpretations of events. By collecting and studying newspapers from the late 1950s and early 1960s, the project aims to trace the early formation of these accounts, which evolved into the deeply divergent narratives that have sustained the conflict till today.

Read the full article by Robert Barnett and Franz Xaver Erhard on the SOAS China Institute’s blog …

Tibetan Modern U-chen Print 0.1

Tibetan Modern U-chen Print 0.1 (TMUP 0.1) is the first Transkribus HTR model for printed Tibetan language publications in Uchen (དབུ་ཅན་ dbu can) script. It has been trained on texts that were published in the PRC between the 1950s and 1980s. The model was trained on 522 pages in 20 documents. The training set consists of 470 pages; the validation set consists of 52 (10%) automatically selected pages. No base model was used. The model was developed by Franz Xaver Erhard (Leipzig University) and Xiaoying 笑影 (Leipzig University) for the Divergent Discourses project (DFG/AHRC).

The model is publicly available within the Transkribus environment. You can view and test the model at

https://readcoop.eu/model/tibetan-modern-u-chen-print/

Details on the model and the Ground Truth of the training set can be viewed on the Transkribus-Site of the Divergent Discourses project:

https://app.transkribus.org/sites/uchan

The training set to the model – consisting of the image files (jpg) and the corresponding Transkribus pageXML files – is available for download from:

Erhard, Franz Xaver, Xiaoying 笑影, Barnett, Robert, Hill, Nathan W., 2024. Tibetan Modern U-chen Print (TMUP) 0.1: Training Data for a Transkribus HTR Model for Modern Tibetan Printed Texts. https://doi.org/10.48796/20240313-000

Tibetan_tokenizers: botok_tokenizer.py

botok_tokenizer.py is a tokenization or word segmentation utility for Tibetan based on BoTok, a tokenizer developed by OpenPecha. It allows you to point to a whole folder or directory or to a single .txt file. It selects just the tokenizer element of BoTok rather than BoTok’s POS tagger element, which we have not included in this utility.

Engels, J., Barnett, R., Erhard, F., & Hill, N. (2024). Tibetan_tokenizers: botok_tokenizer.py (v1.1). Zenodo. https://doi.org/10.5281/zenodo.10810709

View code on Divergent Discourses Github

botok_tokenizer.py was developed by James Engels of SOAS University of London for the Divergent Discourses project. The project is a joint study involving SOAS University of London and Leipzig University, funded by the AHRC in the UK and the DFG in Germany.  Please acknowledge the project in any use of these materials. Copyright for the project resides with the two universities.

Transkribus_utils: Paragraph Extractor: A tool to extract text from Transkribus pageXML

Transkribus transcribes the text on a given page line by line and doesn’t discriminate between different forms of formatting such as headings, marginalia or footnotes. To meaningfully structure and separate the transcribed text into smaller units is crucial for text and corpus analysis, e.g., with the Leipzig Corpus Miner (iLCM). To retrieve structured plain text from Transkribus pageXML, further processing is neccessary.

This repository holds utilities for parsing and extracting useful data from Transkribus PageXML outputs, such as a utility for identifying text regions (Paragraph Extractor), and a utility to reconcile Trankribus output metadata with the equivalent data in relevant library catalogues (coming shortly).

Engels, J., Robert Barnett, Erhard, F. X., & Hill, N. (2024). Transkribus_utils: Paragraph Extractor (v1_Paragraph_Extractor). Zenodo. https://doi.org/10.5281/zenodo.10810509

View code on Diverge Github

Paragraph Extractor is a utility that accepts Transkribus PageXML as input and then interprets the text regions on each page/image (such as headers, titles, blocks of text, etc.), which we term “paragraphs”. It then returns the raw text of each text region (paragraph) along with its metadata. Note that it reads PageXML, not AltoXML.

Paragraph Extractor was developed by James Engels of SOAS University of London for the Divergent Discourses project.

TibNorm: Script to Normalise Tibetan Text

TibNorm is a utility for producing normalised versions of Tibetan texts to make them easier for contemporary users to search and read, in line with current Tibetan writing conventions.

Kyogoku, Yuki, Robbie Barnett, & Franz Xaver Erhard. (2024). TibNorm – Normaliser for Tibetan (Version v1). Zenodo. https://doi.org/10.5281/zenodo.10806456

See code on Diverge github

As part of the normalisation process, TibNorm:

  • changes Tibetan numbers into Arabic numerals
  • changes Tibetan brackets and quotation marks into the standard western equivalents
  • removes a ། if found after a ཀ, ག or ཤ, with or without a vowel – adds a ་ between ང and །
  • reduces two or more ་ to a single one
  • changes ཌ་ or ཊ་ to གས་ unless preceded by a white space, tab, or new line
  • changes non-standard “illegal” stacks into standard ones
  • deletes a ། if found at the beginning of a line

TibNorm also expands abbreviations so that they are shown in their full form. For abbreviations in classical Tibetan, TibNorm draws from the list of over 6,000 classical Tibetan abbreviations compiled by Bruno Lainé of the Tibetan Manuscript Project Vienna (TMPV) as part of the project’s Resources for Kanjur and Tanjur Studies. In TibNorm, the user can manually change the flag in the abbreviations table to exclude any abbreviation that they don’t want to expand.

TibNorm was developed for the Divergent Discourses project by Yuki Kyogoku of Leipzig University.

Tibetan language support in iLCM

The following dataset consists of a Tibetan language model for SpaCy and a list of Tibetan stopwords to enable Tibetan language support in the integrated Leipzig Corpus Miner (iLCM); it includes instructions to upload the model into the iLCM framework.

Engels, James, Erhard, Franz Xaver, Barnett, Robert, & Hill, Nathan W. (2023). Tibetan for Spacy 1.1 (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10120779.

A significant obstacle to using major state-of-the-art NLP applications in Tibetan studies is the lack of support for the Tibetan language. Since the DIVERGE project aims to analyse several thousand pages of Tibetan newspapers, it depends on applications such as the integrated Leipzig Corpus Miner (iLCM).

The iLCM is an interface-based text and corpus mining software package capable of a range of NLP tasks, such as frequency analysis and topic modelling. The iLCM is a wrapper on functions accessible through SpaCy but avoids the scripting requirement of a boilerplate spacy model. It requires a small language model built from the SpaCy NLP package as input. SpaCy has native support for several high- and high-intermediate resource languages and is an industry-standard software package for small-scale English and other European-language NLP integration in a variety of research and corporate environments. No major small language model NLP package available today has native support for Tibetan at any step of the pipeline.

Only little training data for a language model is available for Tibetan as a low-resource language. Nevertheless, to get started, we have developed a preliminary Tibetan language model for SpaCy with limited available training data in CoNLL-U, preprocessing it with BoTok.

The iLCM does not have a native Tibetan model and does not natively support every language supported by SpaCy but is packaged with a few common European languages (English, French, German, Italian, Spanish, and Portuguese). A new model must be separately uploaded with a separate document including an explicit list of stopwords (generally semantically vacuous function words that are not useful for tasks like topic modelling).