Talk: Text-mining propaganda: Studying Tibetan newspapers, 1950s to early 1960s

Date 21 November 2024
Time 5:00 pm to 6:30 pm
Venue Paul Webley Wing (Senate House), SOAS University of London
Room Wolfson Lecture Theatre (SWLT)

Abstract

The 1950s and 1960s were pivotal periods both for the People’s Republic of China and on the Tibetan plateau, where major social and political change took place, including the absorption of Tibet into the new Chinese state and the flight of the Dalai Lama into exile. 

Study of events in Tibet in this period is, however, particularly difficult because archival access is heavily restricted. As a result, historians have to turn largely to propaganda materials from that time, such as the newspapers produced by either side, as primary sources for the study of that period. 

The Divergent Discourse project is trialling the use of newspapers from that time as historical sources and developing techniques for analysing their content. Working with such texts, however, presents numerous difficulties, ranging from distinguishing informational from polemical content to the challenges of digitising Tibetan language and script. In this presentation, we discuss the use of digital humanities tools to mine the corpus we have compiled of Tibetan-language newspapers produced within China and the exile Tibetan community in the 1950s and 1960s, and the prospects of discourse-tracing as a strategy for the historical study of that period.

Speakers

Dr Franz Xaver Erhard (Leipzig) is the PI of the German part of the Divergent Discourses project. He is a philologist specialising in modern Tibetan literature and early Tibetan-language newspapers. He obtained his PhD in Tibetology from Leipzig University and has taught Tibetan in Berlin, Oxford and Leipzig.

Dr Robert Barnett is the PI of the UK part of the Divergent Discourses project, funded by the AHRC. He works on nationality issues in China and modern Tibetan history, politics and culture and is a Professor and Senior Research Fellow at SOAS and an affiliate lecturer and research affiliate at King’s College London. Recent publications and edited volumes include Forceful Diplomacy (Turquoise Roof, 2024); Conflicted Memories with Benno Weiner and Françoise Robin (Brill, 2020); Tibetan Modernities: Notes from the Field with Ronald Schwartz (Brill, 2008); and Lhasa: Streets with Memories (Columbia, 2006).

Organiser

This event is co-hosted by the SOAS China Institute and the SOAS China and Inner Asia Section.

Tibetan language support in iLCM

The following dataset consists of a Tibetan language model for SpaCy and a list of Tibetan stopwords to enable Tibetan language support in the integrated Leipzig Corpus Miner (iLCM); it includes instructions to upload the model into the iLCM framework.

Engels, James, Erhard, Franz Xaver, Barnett, Robert, & Hill, Nathan W. (2023). Tibetan for Spacy 1.1 (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10120779.

A significant obstacle to using major state-of-the-art NLP applications in Tibetan studies is the lack of support for the Tibetan language. Since the DIVERGE project aims to analyse several thousand pages of Tibetan newspapers, it depends on applications such as the integrated Leipzig Corpus Miner (iLCM).

The iLCM is an interface-based text and corpus mining software package capable of a range of NLP tasks, such as frequency analysis and topic modelling. The iLCM is a wrapper on functions accessible through SpaCy but avoids the scripting requirement of a boilerplate spacy model. It requires a small language model built from the SpaCy NLP package as input. SpaCy has native support for several high- and high-intermediate resource languages and is an industry-standard software package for small-scale English and other European-language NLP integration in a variety of research and corporate environments. No major small language model NLP package available today has native support for Tibetan at any step of the pipeline.

Only little training data for a language model is available for Tibetan as a low-resource language. Nevertheless, to get started, we have developed a preliminary Tibetan language model for SpaCy with limited available training data in CoNLL-U, preprocessing it with BoTok.

The iLCM does not have a native Tibetan model and does not natively support every language supported by SpaCy but is packaged with a few common European languages (English, French, German, Italian, Spanish, and Portuguese). A new model must be separately uploaded with a separate document including an explicit list of stopwords (generally semantically vacuous function words that are not useful for tasks like topic modelling).