Summary

Samenvatting Statistics en Methodology

5 purchases

Course
Statistics & Methodology (880259M6)

Institution
Tilburg University (UVT)

Summary including all the effects in R and images of examples.

[Show more]

Preview 4 out of 108 pages

View example

Uploaded on March 27, 2021
Number of pages 108
Written in 2020/2021
Type Summary

multi linear regression
linear regression
bootstrapping
dummy codes
weighted effect codes
unweighted effect codes
covariance
variance
t statistics
data cleaning
mcar
missing data
outliers
normal distribution
prediction
missing data imputation
categorical

Institution
Tilburg University (UVT)
Education
Data Science & Society
Course
Statistics & Methodology (880259M6)

robinvanheesch1

Member since 4 year 93 documents sold

$5.88

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Samenvatting Statistics en Methodology.

Week 1.

Video 1. Basics 1.

Statistical reasoning is thinking carefully about conclusions and precise measurements in our tests.

Data scientists must scrutinize large numbers of data and extract useful knowledge. Data contains
raw information, to convert this info into actionable knowledge. Data scientists apply various data
analytic techniques when presenting the results of such analysis. Data scientists must be careful not
to overstate their findings. Too much confidence in an uncertain finding could lead your employer to
waste large amounts of resources chasing data anomalies. Stats offer us a way to protect ourselves
from ourselves.

Probability distributions quantify how likely it is to observe each possible value of some probabilistic
entity. Probability distributions are re-scaled frequency distributions. We can build up the intuition of
a probability density by beginning with a histogram (density = proportion). With an infinite number of
bins, a histogram smooths into a continuous curve.

 In a loose sense, each point on the curve gives the probability of observing the corresponding
X value in any given sample.
 The AUC must integrate to 1.0

Video 2. Basics 2.

Statistical testing = in practice we may want to distill the information in the preceding plot into a
simple statistic so we can make a judgement. One way to distill this information and control for
uncertainty when generating knowledge is through statistical testing. When we conduct statistical
tests, we weight the estimated effect by the precision of the estimate. A common type of statistical
test, the wald test (t-test) follows this pattern:

If we want to test the null of a zero mean difference applying wald test logic to control for the
uncertainty in our estimate results in the familiar t-test:

,(don’t memorize formulas)!!

You want the test statistic to be large to have more certainty.

How do we use a test statistic to compare for example lap times?

 A test statistic by itself, is just an arbitrary number.
 To conduct the test, we need to compare the test statistic to some objective reference
 This objective reference needs to tell us something about how exceptional our test statistic
is.
 The specific reference we will be employing is known as a sampling distribution of the test
statistic.

A sampling distribution is simply the probability distribution of a parameter.

 The population is defined by an infinite sequence of repeated tests. The sampling distribution
quantifies the possible values of the test statistic over infinite repeated sampling.
 The area of a region under the curve represents the probability of observing a test statistic
within the corresponding interval.

Note that a sampling distribution is a slightly different concept that the distribution of a random
variable:

 The sampling distribution quantifies the possible values of a statistic (mean, t-stat,
correlation coefficient, etc.).
 The distribution of a random variable quantifies the possible values of a variable (age,
gender, income, movie preference, etc.).

The t-test we’ve been considering is a way to summarize the comparison of two variable
distributions.

 The t-stats also has a sampling distribution that quantifies the possible t-values we could get
if we repeatedly drew samples from the variables distributions and re-computed a t-stats
each time.

To quantify how exceptional our estimated t-stats is, we compare the estimated value to a sampling
distribution of t-stats assuming no effect, this distribution quantifies H0  the special case of a H0 of
no effect is called the nil-null. If our estimated statistic would be very unusual in a population where
the H0 is true, we reject the Null and claim a ‘statistically significant’ effect.

,We can find the probability associated with a range of values by computing the area of the
corresponding slice from the distribution.

By calculating the area in the null distribution that exceeds our estimated test statistic, we can
compute the probability of observing the given test statistic, or one more extreme, if the H0 were
true. In other words, we can compute the probability of having sampled the data we observed, or
more unusual data, from a population wherein there is no true mean difference in lap times. This
value is tha infamous p-value.

The preceding test is one-tailed, we use a one-tailed test when we have direction hypotheses. Since
we didn’t expect setup B to out perform setup A, we need to use a two-tailed test.

, Consider the one-tailed test for our estimated test statistic of t = 1.86 that produces a p-value of p =
0.032:

 We cannot say that there is a 0.032 probability that the true mean difference is greater than
zero.
 We cannot say that there is a 0.032 probability that the Ha is true.
 We cannot say that there is a 0.032 probability that the Null hypothesis is false.
 We cannot say that there is a 0.032 probability of replicating the observed effect in future
studies.

How do we actually interpret p-values? The p-value tells us . But what we really want to
know is . All that we can say is that there is a 0.032 probability of observing a test
statistic at least as large as T, if H0 is true. Our test uses the same logic as proof by contradiction.

The probability of observing any individual point on a continuous distribution is exactly zero.

Video 3. Basics 3.

Statistical testing is a very useful tool, but it quickly reaches a limit. In experimental context, real-
world messiness is controlled through random assignment, and statistical testing is a sufficient
method of knowledge generation. Data scientists rarely have the luxury of being able to conduct
experiments. Data scientists work with messy observational data and usually don’t have questions.
That tend themselves to rigorous testing. Data scientists need statistical modeling.

The idea of statistical modeling: modelers attempt to build a mathematical representation of the
(interesting aspects) of a data distribution. The model succinctly describes whatever system is being
analyzed. Beginning with a model ensures that we are learning the important features of a
distribution. The modelling approach is especially important in messy data science applications
where clear a priori hypothesis are rare.

To apply a modelling approach to our example problem we consider the combined distribution of lap
time .the model we construct will explain variation in lap times based on interesting features. In this
simple case the only feature we consider is the type of setup.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller robinvanheesch1. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.88. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

65507 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller