‘Advanced Statistics' is the sequel to 'Introduction Statistics' taught in the first year of the Sociology Bachelor at the University of Amsterdam. The course was taught by Chip Huisman in the academic year . Advanced Statistics focuses on multiple regression techniques, building on previous intr...
Lectures by Chip Huisman
Semester 1, Block 3 2019-2020
Lecture 1 – 06/01/2020
Relationship between 2 variables
We call the analysis of the relationship between 2 variables ‘bivariate analysis’.
Association = Correlation = Relation
- Dependent and independent variable
- Response and explanatory variable
- Outcome and predictor variable
- Y and x variable
We only look at interval/ratio variables.
The relationship between variables can be studied and analyzed by generating and looking
at a scatter plot.
Step-by-step plan for drawing a distribution diagram/scatter plot:
1. Draw the axes and determine which variable goes on which axis
2. Determine the range of the values and mark them on the axes
3. Place a dot for each pair of scores
4. (If necessary, give the dots a name)
The correlation coefficient (Pearson r)
- Displays the linear relationship between 2 interval/ratio variables
- A positive number indicates positive relation. A negative number a negative relation
- The value lies between -1 (perfect negative correlation) and +1 (perfect positive
correlation). 0 means no correlation at all
- Correlation does not depend on original units of measurement
,Linear relationships
Linear function: y=α + βxx
This formula expresses the values on the y-axis as a linear function of the values on the x-
axis. The formula has a straight line with a slope βx (beta) and y-intercept α.
The slope βx (beta) = a number that indicates how much the value of y increases or
decreases with an increase of one x.
The y-intercept α = a number that indicates where the line crosses the y-axis. This is also
called the constant.
Linear means rectilinear/straight.
Intermezzo
Nominal + order = ordinal
Ordinal + differences equally large = interval
Interval + zero point = ratio
What is a MODEL?
A model is an approximation to reality.
A statistical model is an approximation of a characteristic of individuals within a population.
Everyone within a population has an age. But for a very large population this is very
inconvenient to display. So you give an approximation by calculating the average/mean age.
Ergo, the average/mean is a statistical model.
Similarly, a relationship between two variables within a population can be expressed with a
model.
This relationship between two variables can be represented by a linear function.
Taken together, this is called a linear model.
Least squares prediction equation
Prediction refers to the formal/mathematical aspect of a model. You put data in your model
and your model predicts an outcome.
Estimation refers to the statistical application of a model. You apply a model to sample data
in order to say something about a population. Based on sample data you can estimate a
linear model.
What we try to estimate is the line (a linear model) that best fits the data. The least squares
method (OLS = Ordinary Least Squares) appears to be the most suitable for this.
Prediction and estimation are used interchangeably by many people but there is a
difference.
,Estimating a line based on a cloud of observed data points
We want to find the line that best summarizes data in a line (linear model).
How do we do that?
We need a prediction equation: ^y =a+bx
^y (y-hat) is the predicted value of y given the value of x.
Where we have to calculate the a and the b with:
s ∑ ( x i− x́ )( y i− ý )
b= xy2 =
sx ∑ ( xi −x́)2
a= ý−b x́
Intermezzo
Lower case Greek letters are used for populations parameters.
Roman letters are used for sampling statistics.
The μ (Greek mu) and σ (Greek small sigma) indicate the mean and the standard deviation
of a population (these are often unknown).
ý and s indicate the mean and standard deviation of a sample. These are therefore variables
whose value depends on the sampling.
μ and σ are constants because they are related to observations of the entire population.
ý and s are often used to estimate the often unknown μ and σ .
^y (y-hat) is the predicted value of y given the value of x within a predicted equation.
Formula for the b-coefficient or slope
s xy ∑ ( x i− x́ )( y i− ý )
b= 2
=
s x ∑ ( xi −x́)2
If we divide the covariance by the variance we get the b-coefficient or slope.
Deviation score x = ( x i−x́ )
Deviation score y = ( y i− ý )
Σ (Greek capital sigma) means that you have to add things up.
Step-by-step plan for calculating the b-coefficient:
1. Calculate the means for x and y
2. Calculate all the individual deviations (deviation scores) for x and y
3. Calculate all the individual squared deviations for x
4. Calculate all the deviation scores of x squared
5. Calculate the sum of the deviation scores of x squared
6. Calculate the sum of the deviation scores of x times the deviation scores of y
7. Divide the sum of the deviation scores of x times the deviation scores of y by the sum
of the deviation scores of x squared
, Beware of outliers
An outlier is an extreme value which can have a strong influence on the slope of the
regression line.
The prediction equation has the least squares property
Why is that useful/relevant?
You want the line that gives ^y =a+bx the best fit for our observed cloud of data points.
Therefore you want the smallest Sum of squared errors = SSE.
The SSE is a measure of the discrepancy between the line ^y =a+bx and the cloud of
observed values points.
Properties:
- The sum of the residues is zero
And the line always goes through the center of the data. The point (x́, y´ ¿ ¿
What does the least squares mean and what does the sum of the least squares, or the sum
of squared errors mean?
The line through a point cloud is a model for that point cloud. And you want that model to
represent that point cloud as good as possible.
Real Titanic / Model of Titanic
So, you go look for the best matching/fitting line to the point cloud.
But which line is that? It is the line where the distances between the predicted values for y
and the observed value for y is the smallest. That difference is called the predictor error
(residual).
Point cloud with regression line and residuals -> the most appropriate line is the line where
the sum of the squared residuals is the smallest.
Forecast comparison has the least squares characteristic
- The prediction errors are called residuals:
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller ilariamonese. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $3.33. You're not tied to anything after your purchase.