Statistics & Methodology
Lecture I
Statistical reasoning
The foundation of all good statistical analysis is a deliberate, careful and thorough
consideration of uncertainty. The purpose of statistics is to systematize the way we account
for uncertainty when making data-based decisions.
Data scientist must scrutinize large numbers of data and extract useful knowledge from this
data. To convert this data into knowledge, data scientists apply various data analytic
techniques.
Statistical inference
When doing statistical inference, we focus on how certain variables relate to each other, for
example: does increased spending on advertising correlate with more sales? Is there a relation
between the umber of liquor stores in an area an the amount of crime?
Statistical inference is the process of drawing conclusions about populations or scientific
truths from data.
Types of variables
Categorical/nominal
Nominal Oridinal
Unordered categories, such as gender and Ordered categories, such as level of
marital status. education.
Numerical
Discrete Continuous
A variable that can only take on a certain A variable that can have an infinite number
number of values. There are no in-between of values, such as time and weight.
values, such as number of cars parked in a
lot or coin tosses.
Probability distributions
Probability distributions quantify how likely it is to observe each possible value of some
probabilistic entity, i.e. a list of all possible values of a random variable, along with their
probabilities.
Binominal distributions
Also called a discrete probability distribution as it is used for discrete variables. A random
variable has a binominal distribution if the following conditions are met:
- There are a fixed number of trials (n);
- Each trial has two possible outcomes (success of failure);
- The probability of success (p) is the same for each trial;
- The trials are independent.
,There are several ways to find the binominal distributions of a random variable. First you can
use the formula:
However, you could also use so called binominal tables.
Two of the most interesting properties of a distribution are the expected value and the
variance. The expected value is the mean of the distribution; the average value of all possible
values of a random variable. The variance of a distribution if the average squared distance of
each value from the expected value.
Normal distribution
Also called the continuous probability distribution. A variable has a normal distribution if its
values fall into a smooth continuous curve; Bell-curve.
Each normal distribution has its own mean (µ) and
standard deviation (σ).
Note: if n is large enough you can use the normal
distribution as well.
One special form of the normal distribution is the
standard normal distribution, or Z-distribution. This
has a mean of zero and a standard deviation of 1. A
value on the Z-distribution represents the number of
standard deviations the data is above or below the
mean, this value is called a z-score.
Sampling distributions
A sampling distribution quantifies the possible values of a test statistic over infinite repeated
sampling.
In general, the mean of the sampling distribution equals the mean of the entire population.
This makes sense; the average of the averages from all samples is the average of the
population that the samples came from. Variability in the sample distribution is measured in
terms of the standard error and is calculated by using the following formula:
Where σx is the population standard deviation and n the sample size. As n is the denominator
in this formula, the standard error decreases if n increases. This means that larger samples
give more precision and less change from sample to sample. And as σx is the numerator in this
formula, the standard error of the sample will increase if the population standard deviation
increases. This makes sense; it is harder to estimate the population’s average when the
population varies a lot to begin with.
,Interpretation
In a loose sense, each point on the curve says: there is a … probability of observing the
corresponding value in any given sample.
Statistical testing
In practice, we may want to distill the information from the probability distributions into a
simple statistic so we can make a judgement. One way to distill this information and control
for uncertainty is through statistical testing.
In parametric testing there are two types of tests: the t-score and the z-score. The t-test is a
statistical test used to compare population means for two independent samples. The t-test is
most appropriate when the sample size is small (< 30) and when the population standard
deviation is unknown. The z-test is most appropriate when the population variance is known
and the sample size is large (> 30).
T-test Z-test
Meaning
Parametric test to identify how the means of Parametric test to identify how the means of
two sets of data differ from one another two sets of data differ from one another
when variance is unknown. when variance is known.
Distribution
Student t-distribution Normal distribution
Population variance
Unknown Known
Sample size
Small (< 30) Large (> 30)
P-value
A test statistic by itself is just a number, therefore we need to compare the statistic to some
objective reference. This is done by computing a sampling distribution of the test statistic.
To quantify how exceptional our estimated test statistic is, we compare the estimated value
to a sampling distribution of the test statistic, assuming no effect (the null hypothesis). If our
estimated statistic would be very unusual in a population where the null hypothesis is true,
we reject the null and claim a statistically significant effect.
We can find the probability associated with a range of values by computing the area of the
corresponding slice from the distribution.
, By calculating the area in the null distribution that exceeds our estimated test statistic, we can
compute the probability of observing the given test statistic or if the null hypothesis were true.
In other words; we can compute the probability of having sampled the data we observed from
a population wherein there is no true mean difference in rating.
If you test statistic is close to 0, or at least within the range where most of the results should
fall, then you cannot reject the null hypothesis. If your test statistic is out in the tails of the
distribution, this means that the results of this sample do not verify the claim, so we reject the
null hypothesis.
You can be more specific about your conclusion by noting exactly how far out of the
distribution the test statistic falls. You do this by looking up the test statistic in the distribution
and finding the probability of being at that value or beyond it. This is called the p-value and it
tells you how likely it was that you would have gotten your sample results if the null hypothesis
were true. The farther out the test statistic is on the tails of the distribution, the smaller the
p-value will be, and the more evidence you have against the null hypothesis.
To make a proper decision about whether or not to reject the null hypothesis, you determine
your cutoff probability for your p-value before doing a hypothesis test. This cutoff is called the
alpha level (α). If the p-value is greater than or equal to α, you cannot reject the null
hypothesis. If the p-value is smaller than α, you reject the null hypothesis.
Note: if you do not use a directional hypothesis, your test should be two-tailed.
Interpretation
There is a … probability of observing a test statistic
at least as large as the estimated test statistic, if the
null hypothesis is true. The p-value has the same
logic as proof by contradiction.
Note: we cannot say that that is a … probability of
observing the estimated test statistic, if the null
hypothesis is true. This is because the probability of
observing any individual point on a continuous
distribution is exactly zero.