Study guide intermediate statistics I for Erasmus University College Students.
It is thus especially designed for EUC students and the chapters that are included are named in the description, but of course, you can still use it for your own course! Good luck with your exams.
Discovering Statist...
Statistics Discovering Statistics Using IBM SPSS Statistics QUESTIONS FOR PRACTICE
Summary Discovering Statistics Using IBM SPSS Statistics 26th Dec 2022
BBS1003-Statistics-Syllabus summary, lectures, videos, Andy Field
All for this textbook (4)
Written for
Erasmus Universiteit Rotterdam (EUR)
Liberal Arts And Sciences
Intermediate Statistics I (EUCINT204)
All documents for this subject (3)
1
review
By: rodrigopitufo99 • 2 year ago
Seller
Follow
EUCstudent
Reviews received
Content preview
STUDY GUIDE INTERMEDIATE
STATISTICS I ERASMUS
UNIVERSITY COLLEGE 2020
, PBL 1
Chapter 2; Everything you never wanted to know about statistics
2.2 Building statistical models
Scientists do much the same: they build (statistical) models of real-world processes in an
attempt to predict how these processes operate under certain conditions. It must represent
the data collected (observed data).
Fit The degree to which a statistical model represents the data collected is known as the fit
of the model
Good fit If the engineer uses this model to make predictions about the real world then, because it
so closely resembles reality, she can be confident that these predictions will be
accurate.
Moderate There are some similarities to reality but also some important differences
fit
Poor fit Any predictions based on this model are likely to be completely inaccurate
2.3 Populations and samples
Population The complete set of observations a researcher is interested in.
Sample Subset of a population, often taken for the purpose of statistical inference
Linear Models based upon a straight line
model
Linear models have two types of biases
1. Many models in the scientific literature might not be the ones that fit best
2. Many data sets might not have been published because a linear model was a poor fit
Both biases are there because they didn’t look at a non-linear model
Scatter plot A scatter plot of two variables shows the values of one variable on the Y-axis and the
values of the other variable on the X-axis. They are well suited for revealing the
relationship between the two variables.
Positive There is a positive association between variables X and Y if smaller values of X are
association associated with smaller values of Y and larger values of X are associated with larger
values of Y.
Negative There is a negative association between variables X and Y if smaller values of X are
association associated with larger values of Y and larger values of X are associated with smaller
values of Y. r=-1
Linear relationship There is a perfect linear relationship between two variables if a scatterplot of the
point falls on a straight line. The relationship is linear even if the points diverge from
the line as long as the divergence is random rather than being systematic. r=1
Correlation The correlation measures the direction and strength of the linear relationship
between two quantitative variables Interval and Ratio). Association between X and Y.
Correlation is usually written as r. It can range from -1 to 1. It is symmetric
(correlation XY is the same as YX). It is unaffected by linear transformations.
Third variable A third variable is responsible for the correlation between two other variables.
problem
Covariance The covariance between variables X and Y is a unstandardized measure of linear
association between them
(Pearson) The correlation measures the direction and strength of a linear relation. It is the
Correlation standardized version of the covariance: its value is not dependent on the
measurement scale of the variables. Values near −1 or +1 indicate a strong
negative or positive relation, respectively. Values near 0 indicate a weak relation.
2.4 Statistical models
Outcome= (model) + error
Statistical models are made up of variables and parameters.
Parameter Whereas variables measure data, parameters describe the relation between those
variables. They are constants who represent some truth about the measured variables
, (R.c.). It is a value calculated in a population.
Statistic A value computed in a sample to estimate a parameter
2.4.1. The mean as a statistical model
3 measures for the center of a distribution Mean, median and mode
Mean The mean is calculated by the sum of the observations divided by the number of
observations. In the middle off the x-graph. (variance/deviation)
Median The median is the value that splits the numerically ordered observations into two
equal parts. As such, it is the middle value of all observations. If it is not possible to
split the data in to two equal parts (which is the case if there is an even number of
observations), the median is computed by taking the mean of the two middle
observations. The median is denoted by M
Mode The mode is the most frequent value. If multiple values have the same frequency, the
data has multiple modes.
Measures of Variance, standard deviation, percentiles, quartiles, interquartile range (IQR)
the variability (measure of fit)
of a
distribution
Variance/ The average of the squared deviations from the mean or mean squared error
mean squared
error
Standard Taking the square root of the variance. It is an indicator of how numbers vary from the
deviation mean
same as total error
Degrees of The degrees of freedom of an estimate is the number of independent pieces of
freedom information on which the estimate is based. If they are dependent, we do not have a
(df) degree of freedom. In general, the degrees of freedom for an estimate is equal to the
number of values minus the number of parameters estimated en route to the estimate
in question. The denominator of the variance is the degrees of freedom: (n-1)
2.4.3 Estimating parameters
The parameter you find, has the least error given the data you have.
Method of a method of estimating parameters (such as the mean, or a regression coefficient)
least squares that is based on minimizing the sum of squared errors. The parameter estimate will
be the value, out of all of those possible, that has the smallest sum of squared errors.
The standard The standard normal distribution is a normal distribution with mean 0 and standard
normal deviation 1: N(0,1)
distribution
2.5.1 The standard error
Sampling variation the extent to which a statistic (the mean, median, t, F, etc.) varies in samples
taken from the same population
Sampling the distribution of possible values of a given statistic that we could expect to
distribution/probabil get from a given population
ity distribution
The standard If we have a population with mean μ and standard deviation σ and we
deviation of the repeatedly draw small random samples with n observations from this
sampling population, then the standard deviation of sampling distribution of x¯ is given
distribution of the
sample mean
by: . The standard deviation of the sampling distribution of the
sample mean is also referred to as the standard error (of the mean) (SE).
It is a measure of how representative a sample is likely to be of the population.
Central limit For any population with finite mean μ and finite non-zero variance σ2,
theorem (CLT) the sampling distribution of the sample mean approaches a normal distribution
, σ
with mean μ and standard deviation . The sampling distribution of any
√n
σ
statistic is approximately N ¿, ¿ when n is “large enough”-> even when the
√n
population is not normally distributed. It only holds when (1) n is large enough
(>30) and (2) observations are independent.
2.5.2.1 Calculating confidence intervals
Confidence An interval of reasonable values for the population mean. A confidence interval is
interval a range of scores likely to contain the parameter being estimated. Intervals can
be constructed to be more or less likely to contain the parameter: 95% of 95%
confidence intervals contain the estimated parameter whereas 99% of 99%
confidence intervals contain the estimated parameter. The wider the confidence
interval, the more uncertainty there is about the value of the parameter. It has a
confidence level C, where C is the probability that the interval will capture the
true parameter value in repeated samples
where z=(1-p)/2
2.5.2.3 Calculating confidence intervals in small samples
T- For counteracting the bias. The distributions are bell-shaped and symmetric about 0,
distribution but the precise form depends on their degrees of freedom. We use the notation t(k) for
a t-distribution with k degrees of freedom. T-distributions have more probabilities in the
tails, but if the degrees of freedom increases, the t-distribution approaches the
standard normal distribution.
C= x ± tn-1 x s/√ n
2.5.2.4 Showing confidence intervals visually
By showing them graphically, we can see if they overlap of not (and thus if the mean could be
from the same sample).
If they do not overlap, this can have two reasons
1. Our confidence intervals both contain the population mean, but from different
populations
2. Both samples come from the same population, but 1 of them doesn’t contain the
population mean.
2.6.1 Null hypothesis significance testing
Significance Two approaches
test 1. FISHER A significance test is conducted and the probability value reflects the
strength of the evidence against the null hypothesis.
P<0.01, the data provide strong evidence that the null hypothesis is false.
0.01<P<0.05, The null hypothesis is rejected, but with less confidence.
0.05<P<0.10, weak evidence, cannot be rejected. Higher probabilities provide less
evidence that the null hypothesis is false.
More suitable for scientific research.
2. NEYMAN AND PEARSON Specify an α level before analyzing the data. If the data
analysis results in a probability value below the α level, then the null hypothesis is
rejected; if it is not, then the null hypothesis is not rejected. It does not matter how
significant something is.
More suitable for yes/no decisions
Alernative The prediction that there will be an effect
hypothesis
Null it says that your prediction is wrong and the predicted effect doesn’t exist.
hypothesis
Probability The probability value is the probability of an outcome given the null hypothesis were
value true. It is not the probability of the hypothesis given the outcome.
Significance The probability value below which the null hypothesis is rejected is called the
level level. If the null hypothesis is rejected it only means that the effect is not exactly
zero, it does not tell if its important or large. Finding that an effect is statistically
significant signifies that the effect is real and not due to chance.
Hypotheses can be directional or non-directional.
1.6.1.4 Test statistic
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller EUCstudent. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $6.42. You're not tied to anything after your purchase.