Basic Statistics

3 min read

Table of Contents

Central Tendency
Dispersion/Variability
Distributions
Correlation

This post presents key concepts in statistics that describe or summarize the characteristics of a sample or a dataset, thus enabling successful exploratory data analysis: central tendency, dispersion/variability, distribution and correlation.

Central Tendency #

Central tendency identifies the central position within a dataset. It is commonly measured by the arithmetic mean, the median, and the mode.

Consider a sample containing prices of dresses: DressPrice[28,20,34,19,24,27,28,19,31]
Arithmetic mean (or simply mean) is the sum of values in a sample divided by number of items in the sample = 230/9 = 25.56
Mode is the value that appears most often in a sample; a sample with more than one mode is called multimodal. This sample has two modes – 28 and 19 – and is referred to as bimodal.
Median is the value in the middle of a data sample, population or distribution when the data is arranged in increasing order. If the sample has an odd number of data points, then the median is the middle data value when the data is arranged in increasing order. If the sample has an even number of data points, then the median is the mean of the two middle data values when the data is arranged in increasing order.
The new look dataset after rearranging the values is DressPrice[19,19,20,24,27,28,28,31,34] and median = 27.

Dispersion/Variability #

Range is the numerical difference between the largest and the smallest values
First quartile (Q1) is a measure of dispersion that indicates where the lower 25% of the values are located in a sample. It is calculated as the median of the lower half of the sample. Lower half of the DressPrice sample is [19,19,20,24], its median and therefore Q1=(19+20)/2=19.5.
Third quartile (Q3) is a measure of dispersion that indicates where the upper 25% of the values are located in a sample. It is calculated as the median of the upper half of the sample. The upper half of the DressPrice sample is [28,28,31,34], its median and therefore Q3=(28+31)/2=29.5.
Interquartile range (IQR) is a measure of dispersion that indicates where the middle 50% of the values are located. It is calculated as IQR=Q3-Q1, in this case IQR=29.5-19.5=10.
IQR is a very important statistic because it does not include extreme values (outliers). Consequently, it is an adequate measure for eliminating outliers in a dataset
Variance measures how far data points are spread out from their average value; the closer the variance to zero, the more closely the data points are clustered together. Its official formula is:

Standard deviation quantifies the amount of variation or dispersion of a set of data values; the lower the value, the closer the data points are to the mean on the dataset. Figure 1 below illustrates the difference between low- and high standard deviation.

Figure 1. Low – vs High Standard Deviation
By JRBrown – Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=10777712
Accessed on 10 May 2022

The formula for standard deviation is:

Variance is identical to the square of standard deviation and hence expresses “the same thing” (but more strongly).

Distributions #

Frequency distribution: table or graph displaying the frequency of various outcomes in a sample. It summaries the distribution of values in the sample. See example in Figure 2.

Figure 2. Frequency Distribution of Dress Sales

Probability distribution: a graph or table that depicts the likelihood of all potential outcomes in a sample. It provides the probabilities of occurrence of different possible outcomes in an experiment. Figure 3 shows probability distribution of data from Figure 2.

Figure 3. Example Probability Distribution

Sampling distribution: probability distribution based on a random sample of a dataset

Normal (or Gaussian) distribution represents data that occurs where most values are the same as the mean and only a few are found at the extremities. Approximately 99.7% of the values are within three standard deviations of the mean, while 95.4% are within two standard deviations of the mean and 68.3% are within one standard deviation of the mean. Normal distribution has the same mean, median, and mode. The area under the normal distribution curve is equal to one. See Figure 4 below.

Figure 4. Shape and Properties of the Normal Distribution
D Wells, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons
Accessed on 10 May 2022

Skewness measures how data is distributed around the mean; it can indicate potential bias in a dataset. See Figure 5

Negative skew: the majority of the values exist on the right side of the curve, and the mean is less than the median and mode
Positive skew: the majority of the values exist on the left side of the curve, and the mean is greater than the median and mode

Correlation #

Pearson correlation coefficient a measure of the linear correlation between two variables. +1 indicates the values of the variables move in the same direction in perfect unison, -1 indicates variation in perfectly opposite directions, and zero indicates no correlation at all. See Figure 6 below.

IMPORTANT: Correlation does not imply causation, i.e., two events happening together does not necessarily mean there is a cause-and-effect relationship.

Correlation does not imply causation
Source: https://twitter.com/VirginiaBuysse/status/1420350036144177152
Accessed on 10 May 2022

Join WhatsApp group here
Join Facebook group here

What are your Feelings

Still stuck? How can we help?

Updated on 3 June, 2022

1. Introduction to Artificial Intelligence and Machine Learning

2. Machine Learning Success Factors

3. Build and Use a Quick Machine Learning Model

4. Defining the Business Problem

5. Data Understanding

6. Data Preparation

7. Modeling

8. Predictive Modeling

9. Model Validation

10. Model Deployment

Basic Statistics

Central Tendency #

Dispersion/Variability #

Distributions #

Correlation #

What are your Feelings

Leave a Reply Cancel reply

How can we help you?

Central Tendency #

Dispersion/Variability #

Distributions #

Correlation #

What are your Feelings

Share post:

What's on your mind?

Leave a Reply Cancel reply