• Data is scattered across a variety of business applications operational databases (e.g., CRM, ERP and SCM databases), in data warehouses, in flat files, in spreadsheets, in online feeds, etc.
  • Internal data is the basis of most machine learning projects. It is generated in the normal course of business, e.g., sales data, financial data, HR data, sensor data, data about customers (identification data, demographics, loyalty data, etc), etc
  • Third-party data is acquired from external sources to complement the organization’s own data, e.g., credit data, weather data, market survey results, social media data, economic data from government agencies, etc

The data collected from different sources during this phase is collated into a single simple, flat, tabular data structure made up of rows and columns known as a dataset.

