Predictive Modeling (Supervised Learning): Introduction

1 min read

The goal of predictive modeling, aka supervised learning, is to find the model that generalizes well, i.e., makes correct predictions for queries on new, never-before-seen data. The process involves providing an algorithm with a portion of the data, called the “train set”, to uncover, or “learn”, the relationships between the descriptive features and the target variable then testing the resulting model with the remaining portion of the data, called the “test set” to evaluate how much the model has “learned”. This train/test step is repeated on several candidate algorithms and the model that gives the best performance on the test set is (generally) selected.

Contrast this with statistical inference where the goal is to find the model that best characterizes the relationship between the input variables and the outcome variable, e.g., the line that minimizes mean absolute error across all points in a linear regression model. The process involves using the whole dataset to create the model, i.e, there is no train/test split. While this model can be used to make predictions, it is quite unlikely able to make good predictions on new never-before-seen data. (Stewart, 2019)

Predictive modeling comprises regression tasks and classification tasks.