100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Statistics & Methodology 2017/2018 - Summary $4.27
Add to cart

Summary

Statistics & Methodology 2017/2018 - Summary

 307 views  8 purchases
  • Course
  • Institution

Summary Statistics & Methodology Data Science Logistic Linear Regression Correlation Distribution Centering Estimates Error R Rstudio

Preview 3 out of 24  pages

  • January 10, 2018
  • 24
  • 2017/2018
  • Summary
avatar-seller
Statistics & Methodology
summary
In general
- Purpose of statistics: systematize the way we account for uncertainty when making data-
based decisions.
- High variance (high standard deviation) » do not draw conclusions based on Mdif
- Data Scientist: raw information » data analytic techniques » actionable knowledge
- Do not over-state finding, when presenting results » could lead to a waste of time/money

Probability Distributions
- PD’s quantify how likely is each possible value of some
probabilistic entity
- PD’s are re-scaled frequency distributions
- Big population » histogram turns into a continuous
‘smooth’ curve (total area below: 1.0)

Statistical Testing
- Distil information and control for uncertainty; weigh estimated effect by its precision
- Common type of statistical test, Wald Test: T = Estimate / Variability
- Need to compare the test statistics to some objective reference to conduct the test
- This objective reference – sampling distribution – tells us how exceptional our test is.

Sampling Distribution
- SD is simply the probability distribution of a parameter
o Population is defined by infinite sequence of repeated tests
o SD quantifies the possible values of test statistic over infinite repeated sampling
o Each point on curve represents probability of observing corresponding test statistic
- Sampling distribution ≠ random variable distribution
o SD: quantifies possible values of a statistic (mean, t-statistic, correlation coefficient)
o RVD: quantifies possible values of a variable (age, gender, income, food type)
o SD of T-statistic: draw samples repeatedly from RVD, re-compute T each time
- How exceptional is our estimated t-statistic?
o Compare value of SD of t-statistic assuming no effect (null hypothesis)
o When estimated statistic would be very unusual in a population where the null
hypothesis is true, we reject the null and claim a ‘statistically significant’ effect.
- Computing the probability of events
o Area of corresponding slice from the distribution

P-values
- Calculating the area in null distribution that exceeds
estimated test statistic (5% » 0.05)
o Compute probability of observing given test statistic (or one more extreme) if null
hypothesis is true.

, o Compute probability of having sampled data we observed (or more unusual data)
from a population wherein there is no true mean difference in ratings.

In R




Conclusions
- A considerate evaluation of uncertainty is crucial to any responsible data analysis.
- Even in situations where you may be analysing the entire ‘population’, you’ll need statistical
inference to make reliable projections of future outcomes.
- For simple questions we can use statistical testing to control for uncertainty!



Statistical Modelling
- Statistical testing quickly reaches a limit
- Real-world ‘messiness’ is controlled through random assignment » knowledge generalisation
- Data scientists normally work with messy observational data instead of conduct experiments
- Model: mathematical representation of data distribution
- ^Y = ^B0 + ^B1*X

, Data Model
- Different than algorithmic model
- Modular model, built from probability distributions
- Encode our hypothesised understanding of the system we’re
exploring
- Constructed in a ‘top-down’ theory-driven way

Regression Problem
- Opposite of classification problems
- Has input (X) and output (Y), involves quantitative response
- Simple mean comparison » regression

Probability Distribution
- Unconditional (or marginal) distribution:
o Expected value of Y is the same for each observation
- Conditional distribution:
o Expected value of Y for each observation is defined by
observations’ characteristics

Simple Linear Regression
- The best fit line: ^Y = ^B0 + ^B1*X + e
o ^B0 » intercept » expected value Y, when X = 0
o ^B1 » slope » expected change in Y, for X += 1
o e » estimation error » (Y - ^Y)
- Regression coefficients
o Find best fit line
o Most popular: Residual Sum of Squares (RSS)
(Y - ^Y)²

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller JHessels. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.27. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

53068 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$4.27  8x  sold
  • (0)
Add to cart
Added