Toegepaste Statistiek 1 – Economie en
bedrijfseconomie
WEEK 2
Mean ()
Calculated by the sum of the observations divided by the number of observations.
= =*
> n = number of observations
We use x-bar to denote the mean of all observations of x. Alternatively, x-bar is used to
denote the mean of a sample that is drawn from a population. The mean of the population
itself is denoted by μ.
Median (M)
The value that splits the numerically ordered observations into two equal parts. It is the
middle value of all observations. If it is not possible to split the data in to two equal parts
(uneven number), the median is computed by taking the mean of the two middle
observations. The median is denoted by: M
!! TIP: Zet de getallen van klein naar groot op volgorde. Je kan dan door de het aantal
getallen te tellen en dan delen door 2 de mediaan vinden.
Mode
The most frequent value. If multiple values have the same frequency, the data has multiple
modes.
Variance (s2)
The average of the squared deviations from the mean.
s2 = > s2 = *
!! Disadvantage: since we squared the deviations, the variance does not have the same unit
of measurement as the observations.
➔ Preferred: using standard deviation: by taking the square root of the variance,
the standard deviation does have the same unit of measurement as the
observations.
Standard deviation (s/σ)
The square root of average of the squared deviations from the mean that is the square
root of the variance
s= = > s=
Percentiles
The values that divide a distribution in 100 equal parts. Thus, the p-th percentile of a
distribution is the value for which p% of the distributions is below (or equal to) it.
Quartiles
,The values that divide a distribution in 4 equal parts. The first quartile, Q1, is the 25th
percentile of the data. The second quartile, Q2, is the median. The third quartile, Q3, is the
75th percentile of the data.
Interquartile range (IQR)
The actual spread of the distribution is described by the distance between the 1st and 3rd
quartile, which is called the IQR.
IQR = Q3 – Q1
!! NOTE: outliers influence the interquartile range less than the variance and the standard
deviation.
Density curve
A smooth approximation to the bars of a histogram. It describes the overall pattern of the
distribution of our data. The curve is always on or above the horizontal axis. If we take a
range of values and calculate the area below the curve, we find the proportion of all
observations that fall in this range. Therefore, the total area below the density curve is
exactly 1.
Pattern Name Occurrence
Normal distribution Height of randomly selected men
Uniform distribution Outcomes of a dice roll
68-95-99.7 rule
States that for any normal distribution, approximately 68% of the values lie within 1
standard deviation from the mean, 95% of the values lie within 2 standard deviations from
the mean and 99.7% of the values lie within 3 standard deviations from the mean.
, The standard Normal distribution (Z-score)
Is the normal distribution with mean 0 and standard deviation 1: N(1,0).
If variable ‘X’ follows a Normal distribution N(μ ,σ), we can reduce this distribution to the
standard Normal distribution (i.e. standardize the variable). The standardized variable can be
calculated by:
Z=
Standardizing the distribution: if all values in a normal distribution are transformed to Z-
score, then the distribution will have a mean of 0 and a standard deviation of 1.
!! NOTE: we can use a table (like table A in the formula sheet) to determine the area under
the curve ‘’attached’’ to a z-score.
Covariance (cov)
Sometimes you’re interested in the association between the spread of one variable and the
spread of another variable (e.g. height and weight).
The covariance between variables X and Y is a measure of linear association between them:
cov(X,Y) = *
If the covariance is positive, there is a positive association between the variables.
Conversely, if the covariance is negative, there is a negative relation between the
variables. However, if we want to know something about the strength of the association,
the covariance becomes hard to interpret depending on the scale of our variables, the
covariance changes.
!! Disadvantage: its dependence on the scale of the variables: the covariance between
height and weight measured in m and kg respectively, will be different from the covariance
between height and weight if measured in inches and kg, even if it’s the same sample.
Correlation (r)
Measures the direction and strength of a linear relation. It is the standardized version of the
covariance: its value is not dependent on the measurement scale of variables.
r=
>S = standard deviation
Whereas the covariance can take values between -∞ and +∞, the correlation can take values
between -1 and +1. Values near -1 or +1 indicate a strong negative or positive relation,
respectively. Values near 0 indicate a weak relation.
Least-squares regression
The equation of the least-squares regression line of dependent (explanatory) variable y and
independent (response) variable x is given by
Y = b0 + b1x
With slope b1 = r* and intercept b0 = y - b1, so that we minimize the sum of the squares of
the vertical distances of the data points from the line.