Summary statistics and methodology
Probability distributions
Probability distribution is basically just a mathematical function, that describes all the possible values of any
probabilistic entity.
- Probability distributions quantify how likely it is to observe each possible value of some probabilistic
entity (for example, height)
- Probability distributions are re-scaled frequency distributions.
We could measure all the heights of all women in the Netherlands and put that into a histogram
and then look at how frequent the height of 1.70 is.
With an infinite number of bins, a histogram smooths into a continuous curve.
In a loose sense, each point on the curve gives the probability of observing the corresponding X
value in any given sample.
- It talks about loose sense because you can only talk about the area under the curve, but
not a single point.
The one special characteristic of a probability distribution, compared to some other histograms or
distributions, is that the area under the curve must integrate to 1.0.
- The reason is that the total probability of an event happening is under the entire curve.
Suppose I tell you that the mean lap time for setup A = 118 seconds
The mean lap time for setup B = 110 seconds
First scenario:
- The standard deviation for the times under setup A = 7 seconds and the standard deviation for the
times under setup B = 5 seconds
Second scenario:
- The standard deviation of times under setup A = 35 seconds and the standard deviation under setup B
= 25 seconds
For the first scenario you can hopefully conclude that you can be much more confident recommending setup B
(with a std. of 5) because here the average lap times are measured with much greater precision.
In the second scenario we might not be able to differentiate between the two setups, because the means might
be different, but the individual scores overlap quite a bit. And therefore, we do not only have to consider the
location and the mean of two distributions of lap times, but also the variability.
We will gain insight by conceptualizing our example problem in terms of the underlying distributions of lap
times.
The left one is the first scenario with the standard deviation of 5 for setup B in blue and a standard deviation of
7 for setup A in red. And on the right, you see the second scenario. Now you have a clear image of what is
meant with the scores overlapping a lot in the second scenario.
Statistical testing
the probability distributions give us a good idea of how homogenous/heterogenous populations can be. But
these plots of probability distributions alone are difficult to make judgements. We want an objective statistic,
and this is where statistical testing comes in. We need a measure of variability.
- Common statistical test is the t-test/Wald test.
,T-statistic in R:
A test statistic, by itself, is just an arbitrary number.
- To conduct the test, we need to compare the test statistic to some objective reference
- This objective reference needs to tell us something about how exceptional our test statistic is.
The specific reference that we’re going to use is a so-called sampling distribution of our test statistic.
Sampling distribution
A sampling distribution is simply the probability distribution of a statistic.
- The sampling distribution quantifies the possible values of the test statistic over infinite repeated
sampling.
So, what we want to do is to think about grabbing a sample from the population of lap times,
calculating the mean difference and also the t-statistic and that would be one point that goes into
making up this curve, the sampling distribution. We do this again and get another part, and so on.
The area of a region under the curve represents the probability of observing a test statistic within the
corresponding interval.
- If the value is in the tails, then the value is an improbable value.
Sampling distribution quantifies the possible values of a statistic (mean, t-statistic, correlation, etc.)
Distribution of a random variable quantifies the possible values of a variable (age, gender, income, etc.)
To quantify how exceptional our estimated t-statistic is, we compare the estimated value to a sampling
distribution of t-statistics assuming no effect.
- This distribution quantifies the null hypothesis
The special case of a null hypothesis of no effect is called the nil-null
Interpreting P-values
What the P-value does tell us: the probability that the true t-statistic is larger or equal to our estimated test
statistic given the null hypothesis is true.
- All that we can say is that there is a 0.032 probability of observing a test statistic at least as large as ^t
(the estimated test statistic), if the null hypothesis is true.
Statistical modeling
data scientists rarely have the luxury of being able to conduct experiments and thus to control for confounding
factors. When working with observational data we usually don’t have randomly assigned groups. And this
makes the groups potentially uncomparable. So, statistical testing as a stand-alone tool is only useful in
experimental contexts. And since we need to be able to control for confounding variables also in observational
data, we need statistical modeling.
Modelers attempt to build a mathematical representation of the (interesting aspects) of a data distribution.
Beginning with a model ensures that we are learning the important features of a distribution. We describe that
in terms of variables, and we put them in an equation and use them to understand the world.
- Say I want to know what makes people depressed, then theoretically we could include an infinite
number of possible variables. But this is usually not wise or feasible. And instead, we focus on
interesting parts like the hours of sunlight or the rain (thinking about seasonal depression)
, - And if we do that, we make sure that we learn about the important features of a distribution and thus
the parts that we actually care about.
Het model is gewoon de formule van de regressieanalyse
Inference = relationships among variables
Data science cycle
- Dark grey steps are always important, and you can’t skip them
Processing data basically means getting the raw data into analyzable formats.
In the data cleaning step, we need to look for illegal values, like men being pregnant, or outliers, but also
missing data.
After cleaning the data, you have 3 roads.
- Typical one is EDA exploratory data analysis. Looking at distributions, checking assumptions and so
on.
Especially necessary in secondary data (when you did not collect the data yourself)
Modeling and testing are just the analysis. When you have the results of your analysis you can go on to the step
of evaluating results. This answers how well the results answer your question. Maybe we must improve our
model, for example, to improve the prediction of some outcome variable, like profit. Then you could go back to
modeling.
At some point we can report the findings. This could either be the standard scientific dissemination, like writing
a report for government or reporting back to your boss.
Another way we can proceed is to build a data product. For example, imagen you worked on a stock pricing
algorithm. Then you could deploy that into the real world, which would affect people, policy makers and so on.
Operationalizing research questions
Operationalizing the research question might seem trivial, but it is probably the one that is messed-up the
most. Don’t just assume any meaning.
When presented with a research question you must:
1. Make sure you understand exactly what is being asked.
Don’t ever assume someone’s meaning ask for clarification!
Explain the research question back to the asker
2. Convert each ambiguous aspect of the question into something rigorous and quantifiable
Keep an open mind to alternative operationalizations
Consider how different operationalizations will impact the complexity of the modeling and data as
well as the quality of your results
3. If possible, code the research question into a set of hypotheses.
Analyses with a priori hypotheses will provide stronger answers to the original research questions
than analyses about hypotheses will.
Once you have a well-operationalized research question you need to convert that question into some type of
model or test.
, - Is your problem supervised or unsupervised?
Supervised is that you know what the outcome variable is.
Unsupervised is when you are just trying to understand patterns in the data
- Is your question inference-related or prediction-related?
Are you constrained by extrinsic limitations?
- Characteristics of your audience
Your audience needs to understand the result of your analysis
- Ethical issues or security concerns
Would it be ethical to conduct an experiment?
Are you allowed to talk about the analyses to external parties?
- Limited technology, expertise, or other resources
Do you have access to a supercomputer?
- Deadlines
Do you choose an acceptable analysis that takes 10 days over an analysis that takes a year?
Exploratory data analysis
Exploratory data analysis (EDA) is a way to interactively analyze/explore your data.
- More of a mindset than a specific set of techniques or steps
The main idea is exploring. So, you are not trying to test hypotheses, but it is rather a data driven
approach.
- Often contrast with strict empiricist hypothesis testing
- Very useful (necessary) when faced with strange new data
In EDA, we use a diverse selection of tools to understand what’s happening in our data:
- Statistical graphics histogram, boxplot, scatterplot, trace plots
Can easily be used to investigate the relations in data
- Summary statistics measures of central tendency (median, mode, etc.), measures of dispersion,
other statistics, count and cross tabulations
An equally important aspect of EDA is data screening/cleaning
- Missing data
- Outliers
- Invalid values
When you start out, sometimes it might seem useless to do EDA in cases where you exactly know what you
want to do. But even then, it is useful.
We can’t simply rely on the fit to tell us that something is a valid model. And therefore, EDA by means of
plotting the distributions can be very useful. And of course, also checking diagnostics with regards to outliers.
For example, with the data plots you see below, the statistical test will just interpret these like there linear
lines, and from the output you get in R you don’t get any numeric information that tells us the model is actually
wrong and that we should include curvilinear relationships.
When the data are well-understood we can proceed directly to CDA
If we don’t care about testing hypotheses we can focus on EDA
EDA can be used to generate hypotheses for CDA.
- However, hypotheses must be generated and tested on separate data.
- It may happen that we don’t immediately have a set of hypotheses that we want to test, but maybe
we want to do some exploration and generate hypotheses and then confirm them using CDA.