Also referred to as model tuning, hyperparameter optimization is the experimental process of searching for the combination of hyperparameters for a given dataset and machine learning algorithm that deliver the best model performance as measured on a validation set.
A hyperparameter is a configuration variable that determines the architecture of the model and the learning process, e.g., number of trees in a tree-based model, learning rate, etc. It has to be manually specified by the machine learning engineer / data scientist in an algorithm before training and does not change during model training run.
Contrast a hyperparameter with a parameter which is a variable that is internal to a model and whose value is learned from the dataset during training, e.g., coefficients in a linear regression model, node weights in neural networks, etc. It is often saved as part of the final model.
Important Hyperparameters for Supervised Learning Algorithms #
Algorithm | Hyperparameter |
Regression Logistic regression | Regularization parameter – Lasso (L1) or Ridge (L2) |
Decision tree | Minimum size of leaves Maximum size of leaves Maximum tree depth |
k-nearest neighbors | Number of neighbors (k) |
Support vector machines | Kernel type (dot, radial, neural, etc) Kernel parameters(gamma, sigma, degree, etc) Penalty (C) – lower values mean harder boundaries and vice-versa |
Artificial neural networks | Number of neurons in each layer Number of hidden layers Number of training iterations (known as epochs) Learning rate Initial weights |
Random forest | All decision tree hyperparameters Number of trees Number of features to select at each split |
Hyperparameter Optimization Methods #
Algorithms can have many hyperparameters and finding the optimal combination can be treated as a search problem.
Manual tuning is a trial and error method. With experience it is possible to “guess” hyperparameter values that deliver very good performance.
Grid search (or parameter sweeping) uses brute force to test all combinations of a specified hyperparameter subset, measure the performance typically using cross-validation and pick the one that gives the optimal performance. It is computationally expensive.
Random search randomly samples and evaluates hyperparametric combinations from a specified statistical distribution.
Bayesian optimization uses the performance of past choices as the basis for selecting the hyperparameters to consider in each run. It is less computationally expensive than grid search and does not require input from data scientist / machine learning engineer to determine values.
Other methods: gradient-based optimization, evolutionary optimization, population-based optimization, etc