DATA SCIENCE: Variabality. Mean, Median, Mode

What is a Variable?
A variable is an attribute that can be used to describe a person, place, or thing. In the case of statistics, it is any attribute that can be represented as a number. The numbers used to represent variables fall into two categories:

Quantitative variables are those for which the value has numerical meaning. The value refers to a specific amount of something. The higher the number, the more of some attribute the object has. For example, temperature, sales, and number of flyers posted are quantitative variables. Quantitative variables can be:
o Continuous: A value that is measured along a scale (e.g., temperature) or o Discrete:Avaluethatiscountedinfixedunits(e.g.,thenumberofflyers
distributed).

Categorical variables are those for which the value indicates group membership. Thus,
you can’t say that one person, place, or thing has more/less of something based on the number assigned to it because it’s arbitrary. In Rosie’s data, location where the drinks are sold is a categorical variable. Gender is a classic example.

Population vs. Sample

A population includes all elements of interest.
A sample consists of a subset of observations from a population. As a result, multiple samples
can be drawn from the same population.

A measurable outcome of a population is called a parameter; in a sample, it is called a statistic.

Measures of Central Tendency

Central tendency measures simply provide information on the most typical values in the data for a given variable.

Mean: Mean represents the average value of the variable and is calculated by summing across all observations and dividing by the total number of observations.

Median: The median is the middle most value in the data for a given variable; 50% of the values are above, 50% of the values are below. To find the median, you must order your data from smallest to largest.

Md = (n+1)/2

Mode: The mode is the most frequently occurring value for a given variable; if there is more than one mode, report them all. The best way to identify the mode is to plot the data using a histogram.

Measures of Variability

Range: simply the minimum value subtracted from the maximum value:

Range = Max (i) – Min(i)

Variance measures the dispersion of the data from the mean : As you can see, variance is the sum of each observation’s deviation from the mean. We must square these deviations because if we didn’t, the sum would always be zero.

Standard deviation is the square root of the variance.

In normal distributions, 68% of the data fall within +/-1 standard deviation from the mean; 95% within 2 standard deviations, and 99% within 3 standard deviations.

However, the data can take on other shapes, including right (positive) skewed, where the tail of the distribution is on the right side of the curve (as indicated by a median and mode that are less than the mean) or left (negative) skewed, where the tail is to the left (as indicated by a median and mode that are greater than the mean).

Standard error: It indicates how close the sample mean is from the true population mean. The means obtained from samples are estimates of the population mean, and it will vary if we were to calculate the means of different samples from the same population.

𝑆𝐸 = S
√𝑛

Kurtosis is a measure of peakedness. Is the distribution tall and narrow, or is it short and flat?

Skewness is a measure of the symmetry of the data. The skewness value indicates the direction of the tail. If it is positive, the distribution is right skewed; if negative, the distribution is left skewed. A normal distribution has a skew of 0.

DATA SCIENCE

Monday, 22 May 2017

Variabality. Mean, Median, Mode

No comments:

Post a Comment