Sourcing and preparing data for modeling is one of the most difficult and tedious steps in machine learning. The process is iterative, and may require several runs to get the best results. The actual steps and techniques depend on the use case, the dataset and the chosen algorithms; as such domain expertise is key.
Proper data preparation drives much more value than clever algorithms.
Goal #
The goal in this phase of machine learning process is to transform data into a format that is suitable for machine learning algorithms and to facilitate the best performance of the resulting models.
Tasks #
- Data blending: combine data from multiple sources to create a single dataset
- Data cleaning: find and rectify mistakes or errors in the data.
- Dimensionality reduction: reduce the number of features under consideration
- Data transformation: to meet the requirements of certain tools/algorithms
- Feature engineering: construct supplementary features from available data
Roles #
- Domain Expert
- Business User
- Data Scientist
- Data Engineer
- Database Administrator