How to Write Configuration Files in Your Machine Learning Project.

Manage parameters and initial settings with config files.

Davis David
Analytics Vidhya

--

Image by Arthur Daniliuk from Pixabay

When working on a Machine learning project flexibility and reusability are very important to make your life easier while developing the solution. Find the best way to structure your project files can be difficult when you are a beginner or when the project becomes big. Sometime you may end up duplicate or rewrite some part of your project which is not professional as a Data Scientist or Machine learning Engineer.

A quick example is when running different Machine Learning experiments to find the best model for the problem you are trying to solve, most of the time people tend to change the values of the different parameters directly from the source code and run the experiment again and again. They repeat this process until they get the best results. This is not a good approach or technique and you can lose track of the different experiments you have done previously.

Using a configuration file can help you to solve this problem and can add values to your machine learning project.

After reading this article, you will know:

  • Importance of using a configuration file.
  • Introduction to YAML file.
  • Basics syntax of the YAML file.
  • Rules for creating a YAML file.
  • How to write your first YAML file.
  • How to load the YAML file in python.
  • How to use YAML file (as a configuration file) in your Machine Learning Project.

Let’s get started.

So what is the Configuration file?

Definition from Wikipedia “In computing, configuration files (or config files) are files used to configure the parameters and initial settings for some computer programs. They are used for user applications, server processes, and operating system settings.

Wikipedia explains two important points when you define configuration file which are PARAMETERS and INITIAL SETTINGS. These parameters and initial settings can be specific values that should be applied in your system when is running. For example in machine learning, you can set batch_size, optimizer, learning rate,test_size, and evaluation metric as part of the configuration file.

In a simple definition a configuration file, often shortened to the config file, defines the parameters, options, settings, and preferences applied to systems, infrastructure devices, and applications.

“THE USES OF CONFIGURATION FILES IS ALL ABOUT SETTING YOUR APPLICATION HOW IT SHOULD RUN.”

This means you can use a configuration file in your machine learning project. By doing so it will help you to run your project with flexibility and manage your system source code easily e.g when running different machine learning experiments.

There are different file types you can use as your configuration files such as YAML, JSON, XML, INI, and python files. In this article, you will learn more about the most popular configuration file called YAML and how to use it in your machine learning project.

YAML Configuration File

YAML (YAML Ain’t Markup Language) is a human-readable data serialization language. It is commonly used for configuration files but could be used in many applications where data is being stored.” from Wikipedia.

YAML file formats have become a crowd favorite for configurations, presumably for their ease of readability. YAML is relatively easy to write. Within simple YAML files, there are no data formatting items, such as braces and square brackets; most of the relations between items are defined using indentation.

The YAML acronym was shorthand for Yet Another Markup Language. But the maintainers renamed it to YAML Ain’t Markup Language to place more emphasis on its data-oriented features.

Basics Syntax of YAML file

YAML file has a very simple syntax and easier to learn for anyone, this is my main reason for choosing YAML files instead of other types of configuration files. The following basic syntax can help you to start using YAML as your configuration file:-

(a) Comments

In YAML file comments begin with a pound sign.

Example:

# my first comment

(b) key-value Pair

Datatype in YAML is in the form of key-value pairs like other programming languages such as Python, Perl, and javascript.

The key is always a string and the value can be any datatype.

Example

learning_rate: 0.1
evaluation_metric: rmse

(c) Numerical Data

YAML recognizes and support different numerical data type such as integer, decimal, hexadecimal, or octal.

Example.

test_size: 0.2
epochs: 50
scientific_notation: 1e+12

(d)String

Write string in YAML is very simple and you don’t have to specify them in quotes. However, they can be.

Example.

experiment_title: find the best model by using f1 score

(e)Boolean

YAML indicates boolean values with the keywords True, On and Yes for true, and false is indicated with False, Off, or No.

Example.

cross_validation: True
save_model: False

(f) Array

YAML supports the creation of arrays or lists on a single line.

Example.

ages: [24,76,45,21,45]
labels: ["class_one","class_two”,"class_three"]
More YAML syntax you can use

Rules for Creating YAML file

When it’s come to creating a YAML file, you have to follow some very important basic rules.

  • The files should have .yaml as the extension.
  • YAML is case sensitive.
  • Do not use tabs while creating YAML files.

Write your first YAML file.

To create a YAML file, open your favorite text editors such as sublime, vs code, or vim. Then create a new file and save it with the name of your choice example. my_configuration and add .yaml extension at the end. Now you have your first YAML file.

You can start writing different parameters and initial setting values in your my_configuration.yaml file.

here is a simple example for you to understand how it can look like.

How to use a YAML file in a Machine Learning Project.

Until now you have gained new knowledge of basics syntax of YAML file and how to write it. Let’s see how you can use the YAML file as a configuration file in a machine learning project.

Dataset
For this simple machine learning project, I will use the Breast Cancer Wisconsin (Diagnostic) Data Set. The objective of this ML project is to predict whether a person has a benign or malignant tumor.

More information about the dataset can be found here: Breast Cancer Dataset.

From the above source code, it shows how you can run this simple machine learning project from loading the dataset, handle missing values, drop columns, training and testing the model, and finally saving the model. But we didn’t set and use any configuration file to run this project.

You can see a lot of parameters and initial settings that are available in the source code and we can put all of them into a single configuration file.

So what parameters and initial settings we can add into the configuration file?.

  • Data directory.
  • Data name.
  • Column(s) to drop.
  • Target variable name.
  • Test size ratio.
  • Parameters of the classifier(KNN).
  • Model name.
  • Models directory.

Now we have identified what parameters and initial settings, then we can write our configuration file, and name it my_config.yaml.

The my_config.yaml contains all important initial settings and parameters for the K-Nearest Neighbors algorithm to run in our ML project.

How to load the YAML file in Python.
In order to load the YAML file in python, you need to install and use the PyYAML package. PyYAML package is a YAML parser and emitter for Python. The installation process for YAML is fairly straight forward, the easiest way to install the YAML library in Python is via the pip package manager. If you have pip installed in your system, run the following command to download and install YAML:

pip install pyyaml

To read the YAML file in python, first, import the YAML package import yamland then open your YAML file my_config.yaml and load the contents with the safe_load() method from the yaml module.

Now you know how to load the YAML file in python, let’s add the configurations we have identified and put them in our Machine learning project.

Our project source code looks more beautiful and readable, we don't need to change parameters or initial settings directly from the source code, we have a configuration file to do that. We started by import important python package include YAML package, load the configuration file by using load_config() function and add initial settings and parameters in our project.

If you want to change the dataset name, columns to drop, test size ratio, or classifier’s parameters you can do that in the configuration file. Sometimes you can create a new configuration file with the same initial settings and parameter’s names but the different values and run your ML experiments.

Wrap up

Now you understand the importance of using a configuration file in your Machine learning project. In this article, you learned what is a configuration file, important of the configuration file in your machine learning project, how to create a YAML file and use in your ML project. Now you can start using the configuration file in your next machine learning project.

The dataset and source code for this article is available on Github.

If you are interested to learn more about the YAML file, I recommend you read the online materials from tutorial points.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Feel free to leave a comment too. Till then, see you in the next post! I can also be reached on Twitter @Davis_McDavid

One last thing: Read more articles like this in the following links.

--

--

Davis David
Analytics Vidhya

Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing. Contact me to collaborate https://davisdavid.com/