Introducing the Swahili News Dataset for Topic Classification

Photo by Markus Winkler from Pexels

Swahili (also known as Kiswahili) is one of the most spoken languages in Africa. It is spoken by 100–150 million people across East Africa. Swahili is popularly used as a second language by people across the African continent and taught in schools and universities.
In Tanzania, it is one of two national languages (the other is English).


  1. Swahili News
  2. Objective
  3. Implementation
  4. Results
  5. Challenges
  6. Where to Download?
  7. Future Plans

Swahili News

News in Swahili is an important part of the media sphere in Tanzania and other countries in East Africa. News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many African countries.

In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.


Swahili open-source African language text datasets are not often available in Tanzania that results in being left behind in the creation of NLP technologies to solve African challenges.

The goal of this project was to build an open-source text dataset in the Swahili language focused on News articles. I mainly focus on collecting news in different categories such as Local, International, Business or Financial, health, sports, and entertainment.

The dataset is open-source, and NLP practitioners can access the dataset and learn from it.


I was able to implement the following phases of the project in order to achieve the objective of the project.

(a)Collect website with Swahili news
The first phase of the project is to find and collect different websites that provide news in the Swahili language. I was able to find some websites that provide news in Swahili only and others in different languages including Swahili.

(b) Understand policy and copyright.
In this phase of the project, I mainly focus on understanding their policies and copyrights for each website on what I can do and what I can not do.AI4D helped me to understand this process by providing Data Protection Guidelines to consider for data collection and data mining.

© Understand the structure of the news website
Each news website was developed by different web technologies such as PHP, Python, WordPress, Django, javascript e.t.c. The main task is to analyze website source code by using a web browser tool (view page source). I looked at different HTML tags to find news titles, categories, and links to access the content of the particular title.

(d) Data Collection
News articles were collected by using different tools and programming languages. These tools are as follows:

  • Python programming language
  • Jupyter notebook
  • Python open-source packages (NumPy, pandas, and BeautifulSoup)

The collected news articles were saved in a CSV file containing the content(text) and the category(label) of particular news e.g sports.

(e) Analyzing and Cleaning
The collected news articles were analyzed and cleaned to remove irrelevant information such as HTML tags and symbols that were collected during the scrapping process.


At the end of this project, I was able to achieve the following milestones

  • Collecting and organizing around 31,000 news
  • I have collected news from different six categories which are local, international, business, health, sports, and entertainment news.


The main challenge is the imbalance of collected news from different categories. For example, we have few news in international, business and health news.

Where to Downlad?

You can download the datasets from two different versions. The first version (v0.1) was released on December 1, 2020, you can download the dataset from zenodo platform here.
Another way is by using the datasets python library from Hugging Face.

from datasets import load_datasetdataset = load_dataset("swahili_news")

The second version (v0.2) of the dataset was released on September 18, 2021, this version contains both Train and Test sets for topic classification. You can download the dataset from the zenodo platform here.

I’m planning to make sure the dataset will be available on datasets python library for easy access.

Future Plans

The news dataset collected has an imbalance of topic distribution. It contains few news contents on the following topics:-

  • International news( 6.2%)
  • Health news(4.9%)
  • Business news(4.3%)

Therefore, my plans are to find more news resources in the Swahili language and collect more news datasets on the topics mentioned above in order to bring more balance among news topics in the dataset.

This will help AI practitioners to create useful machine learning models that perform well in test environments.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter @Davis_McDavid.

This article was first published here.




Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing. Contact me to collaborate

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Real Data Scientists Don’t Exist

Decarbonizing cement manufacture through AI — Part 2

Weeknotes s02e07

Why we must choose Kubernetes executor for Airflow

The AMO theory: Solving the Black Box Problem for Data Scientists

How to Define Your Learning Path in Data Science

What is Process Mining and how does it suit Business Process Management?

K Means Clustering and various Use-cases

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Davis David

Davis David

Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing. Contact me to collaborate

More from Medium

Amazon Fine Food Reviews Analysis

Creating an E-Commerce Product Category Classifier using Deep Learning — Part 1

COVID-19 Tweet Sentiment Analysis using HuggingFace Pipelines

An Introduction to Natural Language Processing