new blog post: “Narratives, Newspapers and The Tibetan-China Dispute: The Divergent Discourses Project”

Since the interpretive turn in the social sciences some fifty years ago, the study of conflict, and of long-running disputes in particular, has given prominence to the role of narrative and discourse in the perpetuation of antagonisms. As Peter Coleman wrote in a much-cited essay in the 1990s, conflicts of all kinds are driven and sustained by stories: once “contradictory narratives emerge for each of the disputing groups and become promoted to unquestioned fact or truth,” he wrote, those disputes “often cross a threshold into intractability”. But how do such narratives emerge? How do they relate to the original events that triggered the dispute? And how much change do these narratives undergo in their early stages?

Divergent Discourses is a joint SOAS-Leipzig University project, funded by the UK and German research bodies (the AHRC and DFG), that aims to explore these questions by studying the earliest accounts of the Sino-Tibetan conflict. That conflict began with the entry of China’s People’s Liberation Army into Tibet 74 years ago. At that time, the two parties to the dispute immediately turned to public media – primarily to newspapers – to convey their interpretations of events. By collecting and studying newspapers from the late 1950s and early 1960s, the project aims to trace the early formation of these accounts, which evolved into the deeply divergent narratives that have sustained the conflict till today.

Read the full article by Robert Barnett and Franz Xaver Erhard on the SOAS China Institute’s blog …

Tibetan language support in iLCM

The following dataset consists of a Tibetan language model for SpaCy and a list of Tibetan stopwords to enable Tibetan language support in the integrated Leipzig Corpus Miner (iLCM); it includes instructions to upload the model into the iLCM framework.

Engels, James, Erhard, Franz Xaver, Barnett, Robert, & Hill, Nathan W. (2023). Tibetan for Spacy 1.1 (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10120779.

A significant obstacle to using major state-of-the-art NLP applications in Tibetan studies is the lack of support for the Tibetan language. Since the DIVERGE project aims to analyse several thousand pages of Tibetan newspapers, it depends on applications such as the integrated Leipzig Corpus Miner (iLCM).

The iLCM is an interface-based text and corpus mining software package capable of a range of NLP tasks, such as frequency analysis and topic modelling. The iLCM is a wrapper on functions accessible through SpaCy but avoids the scripting requirement of a boilerplate spacy model. It requires a small language model built from the SpaCy NLP package as input. SpaCy has native support for several high- and high-intermediate resource languages and is an industry-standard software package for small-scale English and other European-language NLP integration in a variety of research and corporate environments. No major small language model NLP package available today has native support for Tibetan at any step of the pipeline.

Only little training data for a language model is available for Tibetan as a low-resource language. Nevertheless, to get started, we have developed a preliminary Tibetan language model for SpaCy with limited available training data in CoNLL-U, preprocessing it with BoTok.

The iLCM does not have a native Tibetan model and does not natively support every language supported by SpaCy but is packaged with a few common European languages (English, French, German, Italian, Spanish, and Portuguese). A new model must be separately uploaded with a separate document including an explicit list of stopwords (generally semantically vacuous function words that are not useful for tasks like topic modelling).