There are several types of data. They are often generalized into two groups: continuous and categorical.
Continuous #
Continuous data is able to assume any real numerical value and can be defined on a continuous scale. It comprises the numeric and interval types.
- Numeric: whole number or fraction, positive or negative number that allows full arithmetic operations (eg, weight, temperature, price)
- Interval: number that allows ordering, addition and subtraction operations only (eg, salary scale, date, time)
Categorical #
Categorical data takes a finite set of values and does not allow arithmetic operations. The common types of data in this category are listed below.
- Nominal: only takes a limited set of values with no meaningful intrinsic ordering (eg, color, country, profession)
- Ordinal: only takes a limited set of values with a meaningful intrinsic ordering (e.g., size coded as small, medium, or large; product rating; fault severity coded as minor, major, or critical)
- Binary: only takes two values (eg, gender; color coded as bright or dull; time of day coded as day or night)
- Text: free-form text data (e.g., name, address, emails, documents, reviews, social media posts)
- Image data: stored as pixels containing information about color and intensity
- Geospatial data contains location information
- Network data contains data about how items are connected and the strengths and directions of the relationships
Why Understanding Data Types is Important? #
Assigning the correct type to a feature is critical because it determines how the data will be stored and how it will be used by machine learning algorithms. Consider a variable “profession” coded as 1, 2, 3, etc, where 1 = nurse, 2 = engineer, 3 = plumber, etc. If it were to be incorrectly specified as a numeric variable, then machine learning algorithms would calculate its sum, average, standard deviation, and so on, which is obviously meaningless.