Statistical reasoning is thinking carefully about conclusions and precise measurements in our tests.
Data scientists must scrutinize large numbers of data and extract useful knowledge. Data contains
raw information, to convert this info into actionable knowledge. Data scientists apply various data
analytic techniques when presenting the results of such analysis. Data scientists must be careful not
to overstate their findings. Too much confidence in an uncertain finding could lead your employer to
waste large amounts of resources chasing data anomalies. Stats offer us a way to protect ourselves
from ourselves.
Probability distributions quantify how likely it is to observe each possible value of some probabilistic
entity. Probability distributions are re-scaled frequency distributions. We can build up the intuition of
a probability density by beginning with a histogram (density = proportion). With an infinite number of
bins, a histogram smooths into a continuous curve.
In a loose sense, each point on the curve gives the probability of observing the corresponding
X value in any given sample.
The AUC must integrate to 1.0
Video 2. Basics 2.
Statistical testing = in practice we may want to distill the information in the preceding plot into a
simple statistic so we can make a judgement. One way to distill this information and control for
uncertainty when generating knowledge is through statistical testing. When we conduct statistical
tests, we weight the estimated effect by the precision of the estimate. A common type of statistical
test, the wald test (t-test) follows this pattern:
If we want to test the null of a zero mean difference applying wald test logic to control for the
uncertainty in our estimate results in the familiar t-test:
,(don’t memorize formulas)!!
You want the test statistic to be large to have more certainty.
How do we use a test statistic to compare for example lap times?
A test statistic by itself, is just an arbitrary number.
To conduct the test, we need to compare the test statistic to some objective reference
This objective reference needs to tell us something about how exceptional our test statistic
is.
The specific reference we will be employing is known as a sampling distribution of the test
statistic.
A sampling distribution is simply the probability distribution of a parameter.
The population is defined by an infinite sequence of repeated tests. The sampling distribution
quantifies the possible values of the test statistic over infinite repeated sampling.
The area of a region under the curve represents the probability of observing a test statistic
within the corresponding interval.
Note that a sampling distribution is a slightly different concept that the distribution of a random
variable:
The sampling distribution quantifies the possible values of a statistic (mean, t-stat,
correlation coefficient, etc.).
The distribution of a random variable quantifies the possible values of a variable (age,
gender, income, movie preference, etc.).
The t-test we’ve been considering is a way to summarize the comparison of two variable
distributions.
The t-stats also has a sampling distribution that quantifies the possible t-values we could get
if we repeatedly drew samples from the variables distributions and re-computed a t-stats
each time.
To quantify how exceptional our estimated t-stats is, we compare the estimated value to a sampling
distribution of t-stats assuming no effect, this distribution quantifies H0 the special case of a H0 of
no effect is called the nil-null. If our estimated statistic would be very unusual in a population where
the H0 is true, we reject the Null and claim a ‘statistically significant’ effect.
,We can find the probability associated with a range of values by computing the area of the
corresponding slice from the distribution.
By calculating the area in the null distribution that exceeds our estimated test statistic, we can
compute the probability of observing the given test statistic, or one more extreme, if the H0 were
true. In other words, we can compute the probability of having sampled the data we observed, or
more unusual data, from a population wherein there is no true mean difference in lap times. This
value is tha infamous p-value.
The preceding test is one-tailed, we use a one-tailed test when we have direction hypotheses. Since
we didn’t expect setup B to out perform setup A, we need to use a two-tailed test.
, Consider the one-tailed test for our estimated test statistic of t = 1.86 that produces a p-value of p =
0.032:
We cannot say that there is a 0.032 probability that the true mean difference is greater than
zero.
We cannot say that there is a 0.032 probability that the Ha is true.
We cannot say that there is a 0.032 probability that the Null hypothesis is false.
We cannot say that there is a 0.032 probability of replicating the observed effect in future
studies.
How do we actually interpret p-values? The p-value tells us . But what we really want to
know is . All that we can say is that there is a 0.032 probability of observing a test
statistic at least as large as T, if H0 is true. Our test uses the same logic as proof by contradiction.
The probability of observing any individual point on a continuous distribution is exactly zero.
Video 3. Basics 3.
Statistical testing is a very useful tool, but it quickly reaches a limit. In experimental context, real-
world messiness is controlled through random assignment, and statistical testing is a sufficient
method of knowledge generation. Data scientists rarely have the luxury of being able to conduct
experiments. Data scientists work with messy observational data and usually don’t have questions.
That tend themselves to rigorous testing. Data scientists need statistical modeling.
The idea of statistical modeling: modelers attempt to build a mathematical representation of the
(interesting aspects) of a data distribution. The model succinctly describes whatever system is being
analyzed. Beginning with a model ensures that we are learning the important features of a
distribution. The modelling approach is especially important in messy data science applications
where clear a priori hypothesis are rare.
To apply a modelling approach to our example problem we consider the combined distribution of lap
time .the model we construct will explain variation in lap times based on interesting features. In this
simple case the only feature we consider is the type of setup.
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper robinvanheesch1. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €5,49. Je zit daarna nergens aan vast.