Data Leakage (or Leakage)

< 1 min read

Data leakage occurs when a model is trained using information that will not be available when the model is deployed to make predictions on new data. Leakage results in overly optimistic model performance during modeling (e.g., 97% accuracy) and very low generalization performance in production. The two types of leakage are outlined below.

Target leakage occurs when an algorithm is trained on a dataset containing information that will not be available at the time of making predictions on new data, e.g., including call duration in a model for predicting whether a cold call prospect will buy a service is wrong because call duration will only be known after the call has been made. Careful consideration of timing or chronological order of availability of data reduces the chance of target leakage.

Train-test contamination occurs when model-based preprocessing and/or feature engineering is done before splitting the dataset into a train set and a test set, resulting in corruption of the test set. This can be avoided by treating train set and test set separately but in exactly the same way.

  • Join WhatsApp group here
  • Join Facebook group here

Powered by BetterDocs

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top