A machine learning model is a mathematical representation of the values contained in a dataset and their relationships to each other. Often a simple file, it is the output of a “training” process in which a machine learning algorithm is optimized to extract certain patterns from a dataset. In supervised learning the model is used to make predictions on new data while in unsupervised learning it contains the natural groupings (clusters, associations, etc) in the data.
Since it is impossible to know in advance which machine learning algorithm will perform the best for a given problem the only way is to try out as many algorithms as possible. This is known as the “No Free Lunch Theorem”.
Goals #
- Train the best performing model possible within the constraints of the project, e.g., available data, IT resources, time, etc
- Build a reusable software artifact that produces reliable results in the future
Some algorithms impose certain requirements on the form of data, often making it necessary to go back to the data preparation phase.
Tasks #
- Decide type of machine learning problem, i.e, supervised learning, unsupervised learning or reinforcement learning
- Devise set of modeling experiments including model validation procedure
- Train and evaluate models
- Compare competing models
- Select, tune and debug the most suitable model
- Save deployable artifacts, e.g., data preparation pipeline, models, etc
Roles #
- Data Scientist
- Data Engineer
- Data Scientist
- Business User
- Domain Expert
- Project Sponsor