Summary

Statistics and Methodology Summary

Name: Statistics and Methodology Summary
SKU: doc_597149
Rating: 4.33 (3 reviews)
Author: dc070498

3 reviews

27 purchases

Course
Statistics and Methodology (880259M6)

Institution
Tilburg University (UVT)

Extensive summary for the Masters course in Statistics and Methodology ( R Code Snippets). It contains all the information from the lectures, plus clear examples, additional information and the important formulas.

[Show more]

Preview 4 out of 24 pages

View example

Uploaded on October 18, 2019
Number of pages 24
Written in 2019/2020
Type Summary

statistics
methodology
data science
regression
analysis
data
linear regression
modeling
statistical
stats
rstudio
missing data
outliers
data analysis
statistical models
r
multiple linear regression

Institution
Tilburg University (UVT)
Education
Data Science & Society
Course
Statistics and Methodology (880259M6)

3 reviews

By: mdegal • 2 year ago

By: marcoavilez96 • 3 year ago

By: mariskavankan • 4 year ago

dc070498

Member since 5 year 198 documents sold

Available practice questions

Sample exam questions for Statistics and Methodology

Flashcards 50 Flashcards

$3.74 4 sales

Flashcards 50 Flashcards

$3.74 4 sales

Some examples from this set of practice questions

True or False? Normality of the predictors is an assumption of MLR

Answer: False

What would happen if you were to attempt a statistical analysis without making any assumptions?

Answer: Your analysis would produce an infinite number of equally plausible results.

True or False? The residual errors need to be statistically independent after accounting for the fitted regression model. However, the observations do not need to be independent.

Answer: True

Why is analyzing uncertainty an important part of statistics?

Answer: Because statisticians must be able to objectively quantify confidence in their conclusions.

Why are the predictor variables usually assumed to be fixed in multiple linear regression?

Answer: Because multiple linear regression is a model of the conditional mean response, and explicitly modeling the predictors’ distribution does not help model the mean response. Moreover, explicitly modeling the predictors’ distribution would substantial increase the computationally complexity of regression modeling.

True or False? When using multiple linear regression, the errors must be uncorrelated with the predictors.

Answer: True

Why do we need to make assumptions when doing statistical analyses?

Answer: Data, by themselves, do not offer enough information to support statistical inference. We need to assume some properties of the population model that generated the data.

What does the Gauss-Markov Theorem state?

Answer: The Gauss-Markov theorem states that if your linear regression model satisfies the first six classical assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that have the smallest variance of all possible linear estimators.

Why do we need the assumption of constant error variance in multiple linear regression?

Answer: Because multiple linear regression only models the mean response; we do not explicitly model the error variance.

10.

What is the most important consequence of violating the constant, finite error variance assumption?

Answer: The standard errors will be biased.

$4.28

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

MSc. Data Science and Society Tilburg University 2019-2020

Lecture 1: Statistical Inference, Modeling, & Prediction
Statistical Reasoning: The purpose of statistics is to systematize the way that we account for
uncertainty when making data-based decisions.

Probability distributions: They quantify how likely it is to observe each possible value of some
probabilistic entity. Probability distributions are basically re-scaled frequency distributions. With an
infinite number of bins, a histogram smooths into a continuous curve. In a loose sense, each point on
the curve gives the probability of observing the corresponding X value in any given sample. The area
under the curve must integrate to 1.0.

Statistical Testing: We often want to distil the information in the preceding plots into a simple
statistic so we can make judgments. When we conduct statistical tests, we weight the estimated
effect by the precision of the estimate. For example: the Wald Test.
- A test statistic, by itself, is just an arbitrary number.
- Thus, we need to compare the test statistic to some objective reference which will tell us
something about how exceptional our test statistic is. This reference is known as a sampling
distribution. We compare the estimated value to a sampling distribution of t-statistics
assuming no effect (Thus, distribution quantifies the null hypothesis).
o The special case of a null hypothesis of no effect is called the nil-null.
- If our estimated statistic would be very unusual in a population where the null hypothesis is
true, we reject the null and claim a statistically significant effect.

Sampling Distribution: The sampling distribution quantifies the possible values of the test statistic
over infinite repeated sampling (The population is defined by an infinite sequence of repeated tests).
A sampling distribution is a slightly different concept than the distribution of a random variable.
- The sampling distribution quantifies the possible values of a statistic (e.g., f-statistic, t-
statistic, correlation coefficient, mean, etc.).
- The distribution of a random variable quantifies the possible values of a variable (e.g., sex,
age, attitude, salary, music preferences, etc.).

P-value: We can compute the probability of having sampled the data we observed, or more unusual
data, from a population wherein there is no true mean difference in ratings (by calculating the area
in the null distribution that exceeds our estimated test statistic).

Example: if t = 1.86 (test statistic) and the p-value is 0.032. All that we can say is that there is
a 0.032 probability of observing a test statistic at least as large as ˆt, if the null hypothesis is
true (proof by contradiction). We cannot say that there is a 0.032 probability of observing t, if
the null hypothesis is true because the probability of observing any individual point on a
continuous distribution is exactly zero.

One-tailed versus two-tailed: We only use a one-tailed test when we have directional hypotheses.

Statistical testing versus statistical modelling: Statistical testing is a very useful tool, but it quickly
reaches a limit because in experimental contexts, real-world “messiness” is controlled. However,
data scientists are rarely able to conduct experiments and deal with messy observational data. That
is why data scientists need statistical modeling.

1

,MSc. Data Science and Society Tilburg University 2019-2020

Statistical Modeling:
- Modelers attempt to build a mathematical representation of the (interesting aspects) of a
data distribution.
- Modelling the distribution = estimating ˆβ0 and ˆβ1
o Explaining the variation in the distribution by fitting a model to a sample.
- After we estimate ˆβ0 and ˆβ1, we can plug in new predictor data and get a predicted
outcome value for new cases.

Inference versus Prediction:
- When doing statistical inference, we focus on how certain variables relate to the outcome
(Example: Do men have higher job-satisfaction than women?)
- When doing prediction, we want to build a tool that can accurately guess future values.
(Example: Will increasing the number of contact hours improve grades?)

Lecture 2: Simple Linear Regression
Regression problem:
- Regression problems involve modeling a quantitative response.
- The regression problem begins with a random outcome variable, Y
- We hypothesize that the mean of Y is dependent on some set of fixed covariates, X.

Flavors of Probability Distribution:
- Marginal or unconditional: Each observation is expected to have the same value of Y,
regardless of their individual characteristics. There is a constant mean.
- Conditional: The value of Y that we expect for each observation is defined by the
observation’s individual characteristics. The distributions we consider in regression problems
have conditional means.

Projecting a Distribution onto the Plane: On the Y-axis, we plot our outcome variable. The X-axis
represents the predictor variable upon which we condition the mean of Y.

Modeling the X-Y Relationship in the Plane: We want to explain the relationship between Y and X by
finding the line that traverses the scatterplot as “closely” as possible to each point. This line is called
the “best fit line”. For any value of X, the corresponding point on the best fit line is the model’s best
guess for the value of Y.

Best fit line equation:
➔ We still need to account for the
estimation error →

- ε term represents a vector of errors
- The differences between Y and the true
regression line
- The errors, ε, are unknown parameters,
so we must estimate them.

2

,MSc. Data Science and Society Tilburg University 2019-2020

Regression models: In the estimated regression model, Y = ˆβ0 + ˆβ1X + εˆ, the εˆ term represent a
vector of residuals. The differences between Y and the estimated best fit line, ˆβ0 + ˆβ1X. The
residuals, εˆ, are sample estimates of the errors, ε.

3 → left side = expected mean within the population

Estimating the Regression Coefficients: The purpose of regression analysis is to use a sample of N
observed {Yn, Xn} pairs to find the best fit line defined by ˆβ0 and ˆβ1.
- The most popular method to do this involves minimizing the sum of the squared residuals
(i.e., estimated errors).

Residuals as the Basis of Estimation: The εˆn are defined in terms of deviations between each
observed Yn value and the corresponding Yˆn. Each εˆn is squared before summing to remove
negative values and produce a quadratic objective function.

The ordinary least squares (OLS) estimates of β1 and β0: The RSS is a very well-behaved objective
function that admits closed-form solutions for the minimizing values of ˆβ0 and ˆβ1. In the equation,
the betas (βs) are the parameters that OLS estimates. Epsilon (ε) is the random error.

Mean centering: to improve interpretation
- The intercept is defined as the expected value of Y when X = 0. We can use mean centering
so that X = 0 is a meaningful point.
- We mean-center X by subtracting the mean from each Xn.
- Now, suppose the estimated intercept is 143.83. This means that for the average X-value, Y
would be 143.83.
- Centering only translates the scale of the X-axis and does not change the linear relationship.
Thus, the slope won’t change, only the intercept.

Thinking about Inference: We need to use statistical inference to account for the precision with
which we’ve estimated ˆβ0 and ˆβ1. We cannot be sure that the linear relationship will be the same
if we examine a new sample.
- Our regression coefficients both have sampling distributions that we can use to judge the
precision of our estimates (normally distributed).

3

, MSc. Data Science and Society Tilburg University 2019-2020

Standard Errors: The standard deviations of the preceding sampling distributions quantify the
precision of our estimated ˆβ0 and ˆβ1.
- The sampling distributions are theoretical entities because the standard error is still an
estimate.
- Large SE is not good (not precise), small SE is good (quite precise)

Interpreting Confidence Intervals: Say we estimate a regression slope of ˆβ1 = 0.5 with an associated
95% confidence interval of CI = [0.25; 0.75]. We don’t talk about 95% probabilities when interpreting
Cis → instead, we talk about 95% confidence.
- The true value of β1 is fixed. β1 is either in our estimated interval or not. Thus, the
probability that β1 is within our estimated interval is either exactly 1 or exactly 0.
- If we collected a new sample—of the same size—re-estimated our model, and re-computed
the 95% CI for ˆβ1, we would get a different interval. Repeating this process an infinite
number of times results in a distribution of CIs. 95% of those Cis would surround the true
value of β1.
- Thus: We are 95% certain that if we repeat the analysis an infinite number of times, 95% of
the CIs that we’ll find will surround the true value of β1. → Which suggests that we can be
95% certain that the true value of β1 is somewhere between 3.57 and 4.97.
- CIs give us a plausible range for the population value of β → CIs support inferences.

Model Fit for inference: How well does our model describe/ represent the real world? It will never
be perfect. Our model explains some proportion of the outcome’s variability.
- The residual variance will be less than Var(Y)
- Reduce residuals until it is meaningless noise by adding new variables to the model
- We quantify the proportion of the outcome’s variance that is explained by our model using
the R2 statistic
- TSS = total sum of squares
- RSS = residual sum of squares:

- If R2 is 0.62, it means that our variable/ predictor explains 62% of the variability

Model Fit for Prediction: When assessing predictive performance, we will most often use the mean
squared error (MSE) as our criterion.
- The MSE quantifies the average squared prediction error. Taking the square root improves
interpretation. The RMSE estimates the magnitude of the expected prediction error.
- RMSE = 32.06 → we expect prediction errors with magnitudes of 32.06 Y.

4

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller dc070498. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.28. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

62774 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller

Sample exam questions for Statistics and Methodology

Summary

Statistics and Methodology Summary

Document information

Subjects

Written for

3 reviews

Seller

Reviews received

Available practice questions

Sample exam questions for Statistics and Methodology

Content preview

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Quick and easy check-out

Focus on what matters

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?