Data leakage occurs when a model is trained using information that will not be available when the model is deployed to make predictions on new data. Leakage results in overly optimistic model performance during modeling (e.g., 97% accuracy) and very low generalization performance in production. The two types of leakage are outlined below.
Target leakage occurs when an algorithm is trained on a dataset containing information that will not be available at the time of making predictions on new data, e.g., including call duration in a model for predicting whether a cold call prospect will buy a service is wrong because call duration will only be known after the call has been made. Careful consideration of timing or chronological order of availability of data reduces the chance of target leakage.
Train-test contamination occurs when model-based preprocessing and/or feature engineering is done before splitting the dataset into a train set and a test set, resulting in corruption of the test set. This can be avoided by treating train set and test set separately but in exactly the same way.