Goal #
Initial collection of, and familiarization with, the data required to build the selected model
The phase overlaps with the Business Understanding and Data Preparation phases.
Tasks #
- Collect initial data and build experimental datasets
- Identify data quality issues, e.g., missing values, garbage records, duplicates, etc
- Determine suitability of data for project in terms of number of records, distribution of features, quality of labels, etc
- Discover first insights into the data
- Find interesting subsets to form hypotheses regarding hidden information
- Generate ideas about data preparation (the next phase of the project lifecycle)
These tasks require a good understanding of data and techniques in exploratory data analysis (EDA); these are discussed in the remainder of this section.
Roles #
- Project Sponsor
- Domain Expert
- Business User
- Data Engineer
- Database Administrator
- Data Scientist/Machine Learning Engineer
- Data Developer/Software Engineer