I explore language tokenization using FastAI, Spacy, and Huggingface Tokenizers, with a special focus on the less-represented Balochi language.| mlops.systems
The basics around the tokenisation process: why we do it, the spectrum of choices when you get to choose how to do it, and the family of algorithms most commonly used at the moment.| mlops.systems
I share my journey of building language models for Balochi, a language with few digital resources. I discuss assembling a dataset of 2.6 million Balochi words.| mlops.systems
The dual-edged nature of developing a language model for the Balochi language, weighing potential benefits like improved communication, accessibility, and language preservation against serious risks…| mlops.systems
The Balochi language is underrepresented in NLP.| mlops.systems