This article is part of the scikit-learn series. It follows the article on cross-validation, which I recommend reading first before diving into this one. We will build on the concepts and code covered in the cross-validation article to go deeper.
In previous articles, we saw how to optimize the training of a model -- a random forest for a specific dataset. But if we take a closer look at the scikit-learn documentation for random forests, we can see that several parameters could have been adjusted. What would happen if we changed the number of trees -- would it affect our predictions? What if the trees were more complex? What if each tree only saw a percentage of the data? To answer all these questions, we would need to test different random forest models with different configurations and compare their respective performance.
Studying these model parameters (called "hyperparameters") is what we call a hyperparameter search. The goal is to find an optimal model configuration for the problem we want to solve by finding a combination of hyperparameters that performs better than the others. Note that this configuration may be a local optimum, because the sheer number of possible combinations might make testing all of them impractical.
How do we carry out such a search? Once the model algorithm is chosen, there are 3 essential steps to follow.
First, we need to choose which hyperparameter combinations we want to optimize and what values these parameters can take. We need to limit the hyperparameter search space so that the search can actually finish, but also to avoid deliberately searching for impractical combinations that won't be usable afterward.
Second, we need to split the dataset into 2 parts: the training set and the validation set. In a hyperparameter search, we use cross-validation to reduce training biases related to model overfitting, but to compare the proposed models objectively, the final comparison must be done on data they have never seen. The validation set serves precisely this purpose. The training set will therefore be used for training each model but also for testing during each training round.
Third, once all models are trained and tested on the validation set, we need to compare the results obtained and choose the best option based on the metric of our choice.
Our search is now complete -- we have our best model. We can either use this model or further refine the search.
Before getting into practice with scikit-learn, let's pick up the model and dataset from the previous article: the random forest with the Adult Census Income dataset. This will help us stay familiar with the data and keep the same goal: predicting whether individuals earn more than $50k or not.
Here is the code:
For an explanation of this code, refer to the previous articles on scikit-learn. Now that we have our model, how do we run this search? We follow our plan! First step: let's choose the model hyperparameters we want to tune.
To find out what hyperparameters our model has, and more importantly how they're named, scikit-learn provides the get_params method. This method returns all the parameters we can modify along with their current values.
Method output
As you can see, our model has many hyperparameters. There are 2 reasons for this: first, random forests have several aspects we can tune. Second, our model is composed of multiple parts, since it transforms the data from our dataset and then makes its prediction. Naturally, we can also tune the parameters of this data transformation, but that's not the focus here.
Regarding the naming of these hyperparameters, it depends on the class names that make up the model, as well as the names we can assign to them. This naming happens when we create the pipeline and its components. This isn't the topic of this article, but here is the documentation that explains this part.
For this article, we'll choose 3 parameters that we want to combine and optimize:
randomforestclassifier__n_estimators: the number of decision trees in our random forest. This is an integer value.randomforestclassifier__max_depth: the maximum depth of each tree. The deeper a tree is, the more it adapts to the details of the training set, making it more accurate but also more prone to overfitting. This is an integer value.randomforestclassifier__max_samples: the proportion of data each tree can see at most. This value ranges between 0 and 1 in our example.There are obviously others -- I'll refer you to the scikit-learn documentation on RandomForestClassifier if you want to deepen your search.
Now, we just need to find the values we want to test and run the search. But how do we go about it? A simple approach would be to use for loops over the values we want to test. But scikit-learn provides more powerful tools, and that's what we'll use.
The first type of search we'll use is the grid search. The principle of grid search is to test all possible combinations of values among those we provide. For example, in our case, it will test the 3 parameters with the different values we specify. So, if we provide 5 different values per parameter, it will create 5 * 5 * 5 = 125 different combinations.
Grid search is therefore exhaustive, and its combinatorial complexity grows very fast. On the flip side, no option is left untested. So how do we do it with scikit-learn? As usual, there's a class for that: sklearn.model_selection.GridSearchCV. To use it, we'll first define a dictionary containing the parameters and the different values we want to test.
As you can see, the keys of this dictionary are the parameters that the get_params() method gave us. The values are lists of values we want to test. Here, our goal is clearly to discover how these parameters influence the results. The range of some values is intentionally wide.
Then we'll use GridSearchCV. It's straightforward -- you pass it our model, the parameter grid, the metric we want to use, and the number of cross-validations, and that's it!
And yes, hyperparameter search also uses cross-validation -- this helps make the results more reliable. The cv parameter works just like the one in the cross_validate function we covered in the cross-validation article.
Once the object is created, we use the fit method to pass the data and target. After training is complete -- which can take varying amounts of time depending on your machine -- we can find the best parameters discovered. We need to access the best_params_ property:
We can see here that the best-performing model is the one with few trees but very deep, and where each tree sees a small portion of the data. If we want detailed results, we can look at cv_results_. It contains all the information about the test scores on each split, the tested parameters and their values, etc.
Finally, if we want to retrieve the best model, we can find it in the best_estimator_ property.
Random search is a type of search where, like grid search, we limit the parameters and their values. However, where grid search tests all combinations, random search only selects a predetermined number of them. This means we can include many more values for greater diversity without increasing the number of models to test.
To perform this search with scikit-learn, we'll use the sklearn.model_selection.RandomizedSearchCV class. Like its sibling for grid search, we need to define a parameter distribution. But additionally, we must determine in advance the number of models we want to test.
Here, we can use functions that generate a large quantity of numbers, such as np.arange from numpy, which gives us values from 1 to 10, or the uniform.rvs function from scipy, which gives us 1000 random values between 0 and 1 following a uniform distribution.
We could have used simple lists, but numpy and scipy are interesting libraries and I wanted to show you how they can be used.
Then, just like with grid search, we pass the model, the parameter distribution, a metric, and the number of validations, and the model is ready to train.
The n_iter parameter defines the number of models we want to test. Here, we're testing 100, which took almost 4 minutes on my machine. You can lower this if you wish. And in the same way, we can view the best parameters using the best_params_ property.
Obviously, your results may differ from mine due to the random nature of the search. We can also see that the result obtained is very close to the one from the grid search.
Similarly, you can view detailed results in the cv_results_ property and the best model in best_estimator_.
You now know how to optimize and search for the best parameters for a given model and perform a hyperparameter search! Scikit-learn offers yet more search methods based on different heuristics. You can explore the documentation for those.
One next step we could consider would be testing different types of models for a given problem. But you now have all the tools needed to carry out such a project with scikit-learn.
Pilier de Lamalo, Yohann allie expertise technique et pédagogie. Archi dans l'âme, développeur de talent, il apporte son énergie et ses compétences à la scale-up Lamalo. Pédagogue, il n'hésite pas à partager son savoir.
LinkedInGet our best articles every month.
Formateurs opérationnels. IA, data science, développement web. Certifié Qualiopi.
ProjectDébloquer la valeur cachée dans des milliers de documents. Un projet bancaire qui transforme la recherche documentaire en quelques secondes.
ProjectLe premier produit propre de Reboot Conseil. Une solution innovante née de la collaboration.
ProjectDébloquer l'extraction de données hétérogènes. Un projet utilisant l'IA multimodale pour 9 marques.
ProjectOrchestrer plusieurs LLMs et services IA. Un projet créant un système d'agents IA scalable.
ProjectCréer une plateforme IA accessible sur web et mobile. Un projet combinant orchestration IA et mobilité.