NEW10X Faster Labeling with Prompts—Now Generally Available in SaaS

Tales from our Community: Empowering NLP in Low-Resource Languages

Community

In this Tales from our Community, we highlight the work of Shamsuddeen Hassan Muhammad, a researcher affiliated with Masakhane NLP, a grassroots organization that is focused on bringing advancements in NLP to the larger African community. Shamsuddeen is also the Co-Founder of HausaNLP, a community working on promoting NLP work in Hausa by developing Hausa language resources.

“With the support of Label Studio, we can break barriers” – Shamsuddeen Muhammad

In the age of AI, it may be hard for speakers of high-resource languages like English, Spanish, or French to imagine a world without the advancements of Natural Language Processing (NLP) and Generative AI. But for Shamsuddeen and his colleagues, that world is still the reality. Most African Languages are considered Low Resource Languages, which can be understood as “ less studied, resource scarce, less computerized, less privileged, less commonly taught, or low density, among other denominations (Singh, 2008; Cieri et al., 2016; Tsvetkov, 2017)” . In essence, these languages lack the resources for the NLP developments of the past years, prohibiting the creation of these important systems and keeping their benefits from a huge portion of the global population.

There are many issues. Machine translation systems don’t work as intended for many African languages, sometimes mistranslating words in ways that completely change the meaning or context of a sentence. Additionally, many of the automated content moderation systems that exist on social media also don’t work, which has led to posts encouraging hate and causing harm to many groups. Monitoring these posts manually is impossible, and building the AI to do these tasks and more is crucial.

Shamsuddeen’s work aims to change this by building out datasets that can be used to train these models. His main projects focus on developing sentiment analysis and hate speech datasets, as well developing emotion detection datasets, in these languages. In 2023, the group published the AfriSent dataset, which contains over 110,000 tweets in 15 languages. Recently, the group put together AfriHate, a Tweets dataset covering 18 different languages, and an even larger emotion dataset, which covers 28 languages.

When they started, they were able to begin creating these labels manually, but this created many new issues for both data quality and collaboration. Shamsuddeen also pointed out that NLP research in Africa faces serious funding shortages, making it hard for teams to access high-quality tools. This lack of resources makes efficiency even more crucial since they rely on donated time.

“I don’t even know without Label Studio how we would do this” – Shamsuddeen Muhammad

As their project grew, the team knew that they needed a better system for annotation that would scale with their needs. The Enterprise version of Label Studio provided many tools that they needed to be successful. Not only can Label Studio handle all of the approximately 30 languages that the group works on, but it can also support the users who work on them. They loved being able to assign reviewers to ensure that their data was of the highest quality. Shamsuddeen shared that Label Studio has all the expected features, working exactly as they should—no glitches, no lag, just a smooth experience.

And while they’ve already made significant strides, Shamsudeen and his team plan on using Label Studio to continue the invaluable work of bringing the advancements of NLP and Generative AI to the greater African continent.
Interested in having your project featured in a Tales from our Community? Let us know in the Community Slack!

Related Content