Tibetan_tokenizers: botok_tokenizer.py

botok_tokenizer.py is a tokenization or word segmentation utility for Tibetan based on BoTok, a tokenizer developed by OpenPecha. It allows you to point to a whole folder or directory or to a single .txt file. It selects just the tokenizer element of BoTok rather than BoTok’s POS tagger element, which we have not included in this utility.

Engels, J., Barnett, R., Erhard, F., & Hill, N. (2024). Tibetan_tokenizers: botok_tokenizer.py (v1.1). Zenodo. https://doi.org/10.5281/zenodo.10810709

View code on Divergent Discourses Github

botok_tokenizer.py was developed by James Engels of SOAS University of London for the Divergent Discourses project. The project is a joint study involving SOAS University of London and Leipzig University, funded by the AHRC in the UK and the DFG in Germany. Please acknowledge the project in any use of these materials. Copyright for the project resides with the two universities.