100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Statistics and Methodology $6.15
Add to cart

Summary

Summary Statistics and Methodology

3 reviews
 148 views  16 purchases
  • Course
  • Institution

Summary of the course Statistics and Methodology (Msc Data Science and society).

Preview 4 out of 50  pages

  • January 18, 2023
  • 50
  • 2022/2023
  • Summary

3  reviews

review-writer-avatar

By: jdebeeld • 2 months ago

review-writer-avatar

By: samuel27 • 2 months ago

review-writer-avatar

By: elinavald • 1 year ago

avatar-seller
Summary statistics and methodology

Probability distributions
Probability distribution is basically just a mathematical function, that describes all the possible values of any
probabilistic entity.
- Probability distributions quantify how likely it is to observe each possible value of some probabilistic
entity (for example, height)
- Probability distributions are re-scaled frequency distributions.
 We could measure all the heights of all women in the Netherlands and put that into a histogram
and then look at how frequent the height of 1.70 is.

With an infinite number of bins, a histogram smooths into a continuous curve.
In a loose sense, each point on the curve gives the probability of observing the corresponding X
value in any given sample.
- It talks about loose sense because you can only talk about the area under the curve, but
not a single point.
The one special characteristic of a probability distribution, compared to some other histograms or
distributions, is that the area under the curve must integrate to 1.0.
- The reason is that the total probability of an event happening is under the entire curve.

Suppose I tell you that the mean lap time for setup A = 118 seconds
The mean lap time for setup B = 110 seconds
First scenario:
- The standard deviation for the times under setup A = 7 seconds and the standard deviation for the
times under setup B = 5 seconds
Second scenario:
- The standard deviation of times under setup A = 35 seconds and the standard deviation under setup B
= 25 seconds
For the first scenario you can hopefully conclude that you can be much more confident recommending setup B
(with a std. of 5) because here the average lap times are measured with much greater precision.
In the second scenario we might not be able to differentiate between the two setups, because the means might
be different, but the individual scores overlap quite a bit. And therefore, we do not only have to consider the
location and the mean of two distributions of lap times, but also the variability.

We will gain insight by conceptualizing our example problem in terms of the underlying distributions of lap
times.




The left one is the first scenario with the standard deviation of 5 for setup B in blue and a standard deviation of
7 for setup A in red. And on the right, you see the second scenario. Now you have a clear image of what is
meant with the scores overlapping a lot in the second scenario.


Statistical testing
the probability distributions give us a good idea of how homogenous/heterogenous populations can be. But
these plots of probability distributions alone are difficult to make judgements. We want an objective statistic,
and this is where statistical testing comes in. We need a measure of variability.
- Common statistical test is the t-test/Wald test.

,T-statistic in R:




A test statistic, by itself, is just an arbitrary number.
- To conduct the test, we need to compare the test statistic to some objective reference
- This objective reference needs to tell us something about how exceptional our test statistic is.
The specific reference that we’re going to use is a so-called sampling distribution of our test statistic.


Sampling distribution
A sampling distribution is simply the probability distribution of a statistic.
- The sampling distribution quantifies the possible values of the test statistic over infinite repeated
sampling.
 So, what we want to do is to think about grabbing a sample from the population of lap times,
calculating the mean difference and also the t-statistic and that would be one point that goes into
making up this curve, the sampling distribution. We do this again and get another part, and so on.
The area of a region under the curve represents the probability of observing a test statistic within the
corresponding interval.
- If the value is in the tails, then the value is an improbable value.

Sampling distribution  quantifies the possible values of a statistic (mean, t-statistic, correlation, etc.)
Distribution of a random variable  quantifies the possible values of a variable (age, gender, income, etc.)

To quantify how exceptional our estimated t-statistic is, we compare the estimated value to a sampling
distribution of t-statistics assuming no effect.
- This distribution quantifies the null hypothesis
 The special case of a null hypothesis of no effect is called the nil-null


Interpreting P-values
What the P-value does tell us: the probability that the true t-statistic is larger or equal to our estimated test
statistic given the null hypothesis is true.
- All that we can say is that there is a 0.032 probability of observing a test statistic at least as large as ^t
(the estimated test statistic), if the null hypothesis is true.


Statistical modeling
data scientists rarely have the luxury of being able to conduct experiments and thus to control for confounding
factors. When working with observational data we usually don’t have randomly assigned groups. And this
makes the groups potentially uncomparable. So, statistical testing as a stand-alone tool is only useful in
experimental contexts. And since we need to be able to control for confounding variables also in observational
data, we need statistical modeling.

Modelers attempt to build a mathematical representation of the (interesting aspects) of a data distribution.
Beginning with a model ensures that we are learning the important features of a distribution. We describe that
in terms of variables, and we put them in an equation and use them to understand the world.
- Say I want to know what makes people depressed, then theoretically we could include an infinite
number of possible variables. But this is usually not wise or feasible. And instead, we focus on
interesting parts like the hours of sunlight or the rain (thinking about seasonal depression)

, - And if we do that, we make sure that we learn about the important features of a distribution and thus
the parts that we actually care about.
Het model is gewoon de formule van de regressieanalyse

Inference = relationships among variables


Data science cycle




- Dark grey steps are always important, and you can’t skip them

Processing data basically means getting the raw data into analyzable formats.
In the data cleaning step, we need to look for illegal values, like men being pregnant, or outliers, but also
missing data.
After cleaning the data, you have 3 roads.
- Typical one is EDA  exploratory data analysis. Looking at distributions, checking assumptions and so
on.
 Especially necessary in secondary data (when you did not collect the data yourself)
Modeling and testing are just the analysis. When you have the results of your analysis you can go on to the step
of evaluating results. This answers how well the results answer your question. Maybe we must improve our
model, for example, to improve the prediction of some outcome variable, like profit. Then you could go back to
modeling.
At some point we can report the findings. This could either be the standard scientific dissemination, like writing
a report for government or reporting back to your boss.
Another way we can proceed is to build a data product. For example, imagen you worked on a stock pricing
algorithm. Then you could deploy that into the real world, which would affect people, policy makers and so on.

Operationalizing research questions
Operationalizing the research question might seem trivial, but it is probably the one that is messed-up the
most. Don’t just assume any meaning.
When presented with a research question you must:
1. Make sure you understand exactly what is being asked.
 Don’t ever assume someone’s meaning  ask for clarification!
 Explain the research question back to the asker
2. Convert each ambiguous aspect of the question into something rigorous and quantifiable
 Keep an open mind to alternative operationalizations
 Consider how different operationalizations will impact the complexity of the modeling and data as
well as the quality of your results
3. If possible, code the research question into a set of hypotheses.
 Analyses with a priori hypotheses will provide stronger answers to the original research questions
than analyses about hypotheses will.

Once you have a well-operationalized research question you need to convert that question into some type of
model or test.

, - Is your problem supervised or unsupervised?
 Supervised is that you know what the outcome variable is.
 Unsupervised is when you are just trying to understand patterns in the data
- Is your question inference-related or prediction-related?

Are you constrained by extrinsic limitations?
- Characteristics of your audience
 Your audience needs to understand the result of your analysis
- Ethical issues or security concerns
 Would it be ethical to conduct an experiment?
 Are you allowed to talk about the analyses to external parties?
- Limited technology, expertise, or other resources
 Do you have access to a supercomputer?
- Deadlines
 Do you choose an acceptable analysis that takes 10 days over an analysis that takes a year?

Exploratory data analysis
Exploratory data analysis (EDA) is a way to interactively analyze/explore your data.
- More of a mindset than a specific set of techniques or steps
 The main idea is exploring. So, you are not trying to test hypotheses, but it is rather a data driven
approach.
- Often contrast with strict empiricist hypothesis testing
- Very useful (necessary) when faced with strange new data

In EDA, we use a diverse selection of tools to understand what’s happening in our data:
- Statistical graphics  histogram, boxplot, scatterplot, trace plots
 Can easily be used to investigate the relations in data
- Summary statistics  measures of central tendency (median, mode, etc.), measures of dispersion,
other statistics, count and cross tabulations

An equally important aspect of EDA is data screening/cleaning
- Missing data
- Outliers
- Invalid values

When you start out, sometimes it might seem useless to do EDA in cases where you exactly know what you
want to do. But even then, it is useful.
We can’t simply rely on the fit to tell us that something is a valid model. And therefore, EDA by means of
plotting the distributions can be very useful. And of course, also checking diagnostics with regards to outliers.
For example, with the data plots you see below, the statistical test will just interpret these like there linear
lines, and from the output you get in R you don’t get any numeric information that tells us the model is actually
wrong and that we should include curvilinear relationships.




When the data are well-understood  we can proceed directly to CDA
If we don’t care about testing hypotheses  we can focus on EDA
EDA can be used to generate hypotheses for CDA.
- However, hypotheses must be generated and tested on separate data.
- It may happen that we don’t immediately have a set of hypotheses that we want to test, but maybe
we want to do some exploration and generate hypotheses and then confirm them using CDA.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller sophiedekkers54. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $6.15. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

51662 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling
$6.15  16x  sold
  • (3)
Add to cart
Added