Data Leakage (or Leakage)

< 1 min read

Data leakage occurs when a model is trained using information that will not be available when the model is deployed to make predictions on new data. Leakage results in overly optimistic model performance during modeling (e.g., 97% accuracy) and very low generalization performance in production. The two types of leakage are outlined below.

Target leakage occurs when an algorithm is trained on a dataset containing information that will not be available at the time of making predictions on new data, e.g., including call duration in a model for predicting whether a cold call prospect will buy a service is wrong because call duration will only be known after the call has been made. Careful consideration of timing or chronological order of availability of data reduces the chance of target leakage.

Train-test contamination occurs when model-based preprocessing and/or feature engineering is done before splitting the dataset into a train set and a test set, resulting in corruption of the test set. This can be avoided by treating train set and test set separately but in exactly the same way.

Join WhatsApp group here
Join Facebook group here

What are your Feelings

Still stuck? How can we help?

Updated on 3 June, 2022

1. Introduction to Artificial Intelligence and Machine Learning

2. Machine Learning Success Factors

3. Build and Use a Quick Machine Learning Model

4. Defining the Business Problem

5. Data Understanding

6. Data Preparation

7. Modeling

8. Predictive Modeling

9. Model Validation

10. Model Deployment

Data Leakage (or Leakage)

What are your Feelings

Leave a Reply Cancel reply

How can we help you?

What are your Feelings

Share post:

What's on your mind?

Leave a Reply Cancel reply