Build Machine Learning Models with scikit-learn

In my previous article, I covered the fundamentals of Machine Learning, the building blocks of this discipline for creating AI models. Today, I want to show you how to build these models using one of the most widely used Machine Learning libraries: scikit-learn.

An Overview of Scikit-Learn and Its Ecosystem

Scikit-Learn is a free, open-source Python library that lets you create and train various Machine Learning models. It belongs to the broader ecosystem of libraries such as pandas, numpy, and scipy, which are used for data manipulation, computation, statistical and probabilistic analysis, and more. Each of these libraries deserves a longer introduction, but I'll stick to explaining how I use them here. Perhaps pandas, numpy, and scipy will be the topic of a future article...

Scikit-Learn is maintained by the scikit-learn consortium, centered around Inria and many key players in the AI world like Nvidia and Hugging Face. It provides a comprehensive set of tools for building models and measuring their performance. To walk you through them, I'll use a hands-on example where we start with a simple model that we can improve in future articles.

The "Adult Census Income" Dataset

Before we can build a model, we first need to load some data and define the problem we want to solve. For this example, I chose the "Adult Census Income" dataset, which is a sample from the 1990 US Census. It's a well-known dataset in the Machine Learning world and has been used in published research on the topic. Before we go further, I encourage you to look up what the US Census is and explore the contents of this dataset.

The goal of this dataset is to predict a person's salary bracket based on census data (social class, education level, age, etc.). The dataset only has 2 salary levels: "more than $50K" and "$50K or less."

To load the dataset and work with it, we'll need to use pandas:

Here, we import the pandas library and use the read_csv function to read the data and store it in the variable df. df is short for DataFrame, which is a structure that lets you store and manipulate tabular data. Let's take a look at what it contains:

The head function displays the first 5 rows of the dataframe. We can see two types of variables:

Categorical variables: which indicate a category (for example sex, education, native.country, etc.)
Numerical variables: which indicate quantities (for example capital.loss, hours.per.week, age)

Numerical variables, by their nature, are necessarily represented by numbers. But not all numeric variables in a dataset are necessarily quantitative. For example, here education.num contains a number that corresponds to the education level, which is a categorical variable. It simply allows us to order the levels relative to each other.

Choosing Our ML Algorithm

Now that we have our data, we need to choose an algorithm. We first need to determine what type of problem we're solving: classification or regression?

A classification algorithm predicts a categorical variable, sorting data into a finite number of classes. Conversely, a regression algorithm aims to predict a quantitative variable. The output will be a number that isn't necessarily bounded and can be as precise as desired.

Here, we want to predict whether a person's salary is above or below $50K, so we have 2 possible categories. This is therefore a classification problem.

There are many types of classification algorithms, each working differently, but for this first model, we'll train a decision tree.

Decision tree on the Iris dataset

The principle of a decision tree is to build, during training, a tree where each node contains a condition. If the condition is met, we follow one path; if not, we follow the other. At the end of each path through the tree, there's a category that represents the value returned if the data sample followed that path.

Implementation with scikit-learn

Enough with the explanations, let's get to the code! First, we need to separate the data from the target. Due to implementation constraints, scikit-learn's decision trees only support numeric values. So we'll only select columns that contain numbers.

Thanks to pandas, we just need to put the column names we want inside square brackets on the dataframe to retrieve the data we need. Now that we have our data and our target, we need to split them into a training set and a validation set. For this, we'll use the train_test_split function, which handles this split for us.

This function takes at least 2 parameters, the data and the target, and returns the data and targets split into 2 sets. I've used additional parameters to better control the split. test_size lets you define the proportion of test data. Here, I've set it to 30%. random_state lets you choose the seed for the random split. Setting it to 0 ensures that my training and test datasets always contain the same data.

Now let's create our model. For this, we'll use the DecisionTreeClassifier class. This class lets us create a decision tree that can classify the data we feed it.

This code creates the model, and the max_depth parameter controls the maximum depth of the tree: there will be at most 3 levels of conditions. The fit method trains it with our training data. This method is common to all scikit-learn models, which standardizes the model API.

We can then measure the performance of the model we created using the score method.

Model Accuracy

So, we can see that the score is measured using the test set, and our model has an accuracy of 80%. It's a decent model, but there's room for improvement. Note that there are many different scoring metrics that vary depending on the type of problem. This is a vast topic, but if you'd like to start exploring it, I recommend checking out the sklearn.metrics documentation. It will show you what these metrics are and how they work.

If we want to see the generated decision tree, we can use the plot_tree function to display it. We'll need the matplotlib library to render it properly.

Generated decision tree

In this diagram, you can see the conditions and how they chain together. We also get information about how many data points in the training set fall into each case, and their distribution. The gini value corresponds to the purity score of each split. The purity score indicates the quality of the split. The higher the score, the more cleanly it separates the 2 targets in our problem.

What About Categorical Variables?

In the example above, we pre-selected only the numeric columns, but how can we include the others? We'll need to preprocess the data before feeding it to the model. Conveniently, scikit-learn provides a set of classes and functions for selecting and preprocessing data.

First, we'll need to redefine our data. We'll go back to the original dataframe df and this time simply remove the target column.

The drop function removes columns from the dataframe. It doesn't modify the original dataframe but returns a version without the columns specified by the columns parameter. Next, we need to detect the non-numeric columns. For this, we'll use the make_column_selector function.

This function is a bit unusual: it takes a dtype_include parameter, which specifies the types of columns to look for, and returns a function that takes a dataframe as input. Here, we're telling it to return columns with object type, which corresponds to strings. A list of dtypes (short for data types) is available in the pandas documentation.

We then applied the selector to our dataframe, which extracted the names of columns with object type. Here's what the cat_columns variable contains:

Now that we've done this, we can move on to the next step: creating a preprocessor. The preprocessor will transform our data before passing it to the model. For this, we'll use 2 classes: OrdinalEncoder, which converts our strings to numbers by assigning each a value between 1 and the number of distinct strings in the sample, and ColumnTransformer, which lets us treat object-type columns differently from the rest.

Let's break down this preprocessor: the ColumnTransformer takes a list of tuples as input. Each tuple has 3 elements: the step name (useful when visualizing the preprocessor), the transformation to apply, and the list of columns affected by this preprocessor. The remainder parameter specifies what to do with columns that haven't been transformed. Here, we leave them as they are, though we could have dropped them.

For the OrdinalEncoder, we're telling it that if it encounters unknown values, it should assign them the value -1. This is because the data split may have separated rare classes in our dataset.

Now that we have our preprocessor, we can use it and connect it to our model with the make_pipeline function. This function lets us chain steps through which our data will pass and treat the whole thing as a single model. We can train it and measure its performance.

We split our data again and create the pipeline, with the preprocessor first and then the model. Then we train the model with the fit method, as before. Now let's look at our model's performance:

The model reached an accuracy of nearly 84%, a 4-point improvement over the version without non-numeric data.

Conclusion

We've seen how to create our first model, a decision tree in our case, and how to measure its performance. We've seen how to create a simple pipeline with a data preparation step. But we've left many things aside:

We didn't preprocess the data. There may be null or outlier values in our dataset that are skewing our results.
We didn't try to find better parameters for our tree.
What happens if we choose a different classification algorithm?
Can we preprocess our data further?
How can we ensure that the split we got isn't biased?
How do we actually use the model in practice?

These are all questions that remain to be addressed, and scikit-learn provides tools to solve these problems, which we'll be able to explore in future articles.

An Overview of Scikit-Learn and Its Ecosystem

The "Adult Census Income" Dataset

To load the dataset and work with it, we'll need to use pandas:

The head function displays the first 5 rows of the dataframe. We can see two types of variables:

Categorical variables: which indicate a category (for example sex, education, native.country, etc.)
Numerical variables: which indicate quantities (for example capital.loss, hours.per.week, age)

Choosing Our ML Algorithm

Now that we have our data, we need to choose an algorithm. We first need to determine what type of problem we're solving: classification or regression?

Here, we want to predict whether a person's salary is above or below $50K, so we have 2 possible categories. This is therefore a classification problem.

There are many types of classification algorithms, each working differently, but for this first model, we'll train a decision tree.

Decision tree on the Iris dataset

Implementation with scikit-learn

Now let's create our model. For this, we'll use the DecisionTreeClassifier class. This class lets us create a decision tree that can classify the data we feed it.

We can then measure the performance of the model we created using the score method.

Model Accuracy

If we want to see the generated decision tree, we can use the plot_tree function to display it. We'll need the matplotlib library to render it properly.

Generated decision tree

What About Categorical Variables?

First, we'll need to redefine our data. We'll go back to the original dataframe df and this time simply remove the target column.

We then applied the selector to our dataframe, which extracted the names of columns with object type. Here's what the cat_columns variable contains:

For the OrdinalEncoder, we're telling it that if it encounters unknown values, it should assign them the value -1. This is because the data split may have separated rare classes in our dataset.

We split our data again and create the pipeline, with the preprocessor first and then the model. Then we train the model with the fit method, as before. Now let's look at our model's performance:

The model reached an accuracy of nearly 84%, a 4-point improvement over the version without non-numeric data.

Conclusion

We didn't preprocess the data. There may be null or outlier values in our dataset that are skewing our results.
We didn't try to find better parameters for our tree.
What happens if we choose a different classification algorithm?
Can we preprocess our data further?
How can we ensure that the split we got isn't biased?
How do we actually use the model in practice?

These are all questions that remain to be addressed, and scikit-learn provides tools to solve these problems, which we'll be able to explore in future articles.

Build Machine Learning Models with scikit-learn

An Overview of Scikit-Learn and Its Ecosystem

The "Adult Census Income" Dataset

Choosing Our ML Algorithm

Implementation with scikit-learn

Model Accuracy

What About Categorical Variables?

Conclusion

Similar articles

N8N, What's That All About?

How AI Is Revolutionizing Marketing (Without Replacing You)

AI Training Needs Assessment Framework: A Guide for HR Directors and Managers

Newsletter

Go further

Formation Tech & IA

RAG pour Accès à l'Information

Assurance Qualité et Tests Avancés

Crakotte : Produit Innovant

Migrations TFS2015 vers Azure DevOps & Modernisation

Extraction Documentaire Multimodale avec Gemini 2.5 Flash

Build Machine Learning Models with scikit-learn

An Overview of Scikit-Learn and Its Ecosystem

The "Adult Census Income" Dataset

Choosing Our ML Algorithm

Implementation with scikit-learn

Model Accuracy

What About Categorical Variables?

Conclusion

Similar articles

N8N, What's That All About?

How AI Is Revolutionizing Marketing (Without Replacing You)

AI Training Needs Assessment Framework: A Guide for HR Directors and Managers

Newsletter

Go further

Formation Tech & IA

RAG pour Accès à l'Information

Assurance Qualité et Tests Avancés

Crakotte : Produit Innovant

Migrations TFS2015 vers Azure DevOps & Modernisation

Extraction Documentaire Multimodale avec Gemini 2.5 Flash