Data cleaning refers to activities to identify and repair errors in a dataset. Its objective is to increase data accuracy, quality and value. Ways of handling common issues with data are outlined below; the actual course of action depends on the use case and the issue.
Missing values
– Replace missing value with the mode (for categorical variable type) or with the median (for numerical variable type) of that feature,
– Use a model to estimate the missing values (tends to give more accurate values, but is more time intensive),
– Remove examples/records with missing values (not for time series data), or
– Do nothing and let algorithms recognize them as just another category
Duplicates
– Remove duplicate examples
Outliers or anomalies
– Correct or remove values if the outlier/anomaly is due to error,
– Remove the complete example/record,
– Keep the example or value if the application requires the outlier values (eg, in fraud detection), or
– Apply clipping to cap all feature values above (or below) a certain value to a fixed value (eg, clip all height values below 145 to be exactly 145)
Date and time
– Convert all timestamps to the format required by the machine learning software or platform, eg, ddmmyyyy:hhmmss
Structural errors
– Fix typos, inconsistent capitalization (eg, Price and price) or wrong/inconsistent values (T-shirt and T Shirt) that arise during data capturing
Sparse features (with too many zeros):
– Remove the features,
– Use dimensionality reduction techniques to make them dense, or
– Use algorithms that are robust to sparsity during modeling
String formatting and non-alphanumeric characters:
– Remove line breaks, symbols, white spaces, etc at the beginning and at the end of values
Data types:
– Use the correct data types to help save memory usage as well enable correct operations such as arithmetic to be performed
Incomplete information (e.g., address does not include name of city/town when the latter is expected):
– Find and add the missing information, or
– Remove the record/example
Invalid data (e.g., values for height with a negative number):
– Find and add the correct value, or
– Remove the record/example
Columns with very few values:
– Remove such columns