Tibetan_tokenizers: botok_tokenizer.py

botok_tokenizer.py is a tokenization or word segmentation utility for Tibetan based on BoTok, a tokenizer developed by OpenPecha. It allows you to point to a whole folder or directory or to a single .txt file. It selects just the tokenizer element of BoTok rather than BoTok’s POS tagger element, which we have not included in this utility.

Engels, J., Barnett, R., Erhard, F., & Hill, N. (2024). Tibetan_tokenizers: botok_tokenizer.py (v1.1). Zenodo. https://doi.org/10.5281/zenodo.10810709

View code on Divergent Discourses Github

botok_tokenizer.py was developed by James Engels of SOAS University of London for the Divergent Discourses project. The project is a joint study involving SOAS University of London and Leipzig University, funded by the AHRC in the UK and the DFG in Germany.  Please acknowledge the project in any use of these materials. Copyright for the project resides with the two universities.

Transkribus_utils: Paragraph Extractor: A tool to extract text from Transkribus pageXML

Transkribus transcribes the text on a given page line by line and doesn’t discriminate between different forms of formatting such as headings, marginalia or footnotes. To meaningfully structure and separate the transcribed text into smaller units is crucial for text and corpus analysis, e.g., with the Leipzig Corpus Miner (iLCM). To retrieve structured plain text from Transkribus pageXML, further processing is neccessary.

This repository holds utilities for parsing and extracting useful data from Transkribus PageXML outputs, such as a utility for identifying text regions (Paragraph Extractor), and a utility to reconcile Trankribus output metadata with the equivalent data in relevant library catalogues (coming shortly).

Engels, J., Robert Barnett, Erhard, F. X., & Hill, N. (2024). Transkribus_utils: Paragraph Extractor (v1_Paragraph_Extractor). Zenodo. https://doi.org/10.5281/zenodo.10810509

View code on Diverge Github

Paragraph Extractor is a utility that accepts Transkribus PageXML as input and then interprets the text regions on each page/image (such as headers, titles, blocks of text, etc.), which we term “paragraphs”. It then returns the raw text of each text region (paragraph) along with its metadata. Note that it reads PageXML, not AltoXML.

Paragraph Extractor was developed by James Engels of SOAS University of London for the Divergent Discourses project.

TibNorm: Script to Normalise Tibetan Text

TibNorm is a utility for producing normalised versions of Tibetan texts to make them easier for contemporary users to search and read, in line with current Tibetan writing conventions.

Kyogoku, Yuki, Robbie Barnett, & Franz Xaver Erhard. (2024). TibNorm – Normaliser for Tibetan (Version v1). Zenodo. https://doi.org/10.5281/zenodo.10806456

See code on Diverge github

As part of the normalisation process, TibNorm:

  • changes Tibetan numbers into Arabic numerals
  • changes Tibetan brackets and quotation marks into the standard western equivalents
  • removes a ། if found after a ཀ, ག or ཤ, with or without a vowel – adds a ་ between ང and །
  • reduces two or more ་ to a single one
  • changes ཌ་ or ཊ་ to གས་ unless preceded by a white space, tab, or new line
  • changes non-standard “illegal” stacks into standard ones
  • deletes a ། if found at the beginning of a line

TibNorm also expands abbreviations so that they are shown in their full form. For abbreviations in classical Tibetan, TibNorm draws from the list of over 6,000 classical Tibetan abbreviations compiled by Bruno Lainé of the Tibetan Manuscript Project Vienna (TMPV) as part of the project’s Resources for Kanjur and Tanjur Studies. In TibNorm, the user can manually change the flag in the abbreviations table to exclude any abbreviation that they don’t want to expand.

TibNorm was developed for the Divergent Discourses project by Yuki Kyogoku of Leipzig University.