100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Applied Multivariate Data Analysis - Week 1 $7.50   Add to cart

Summary

Summary Applied Multivariate Data Analysis - Week 1

 91 views  11 purchases
  • Course
  • Institution
  • Book

chapters 2, 3, 6, and 8 are included in this summary chapter 9 will be posted together with chapter 11 in the Week 2 summary :)

Preview 10 out of 55  pages

  • No
  • Chapter 2, 3, 6, and 8
  • January 11, 2022
  • 55
  • 2021/2022
  • Summary
avatar-seller
Applied Multivariate Data Analysis – Week 1


Ch 2: The Spine of Statistics


The acronym SPINE stands for:

(1) Standard Error
(2) Parameters
(3) Interval Estimates (confidence intervals)
(4) Null Hypothesis significance testing
(5) Estimation

Statistical Models
Scientists collect data from the real world to test predictions from hypotheses about a
phenomenon

- Testing these hypotheses involves building statistical models of the phenomenon
of interest

Scientists build statistical models of real-world processes to predict how these processes
operate under certain conditions

- Scientists do not have access to the real-world situation – and can only infer
things about processes based upon the models built
o The statistical model should represent the data collected – i.e., the
observed data – as closely as possible in order for the predictions to be
accurate

The degree to which the statistical model represents the data collected – called the fit of the
model

1. An excellent representation of the real-world situation => good fit
2. A model with some similarities – but also important differences – to real-world
situation => moderate fit
3. A model that is completely different from the real-world situation => poor fit

, - If model is a poor fit to the observed data – the predictions inferred from it will be
equally poor

Types of Statistical Models

Linear Models  Models based on a straight line
 Statistical systems based on the linear model include ANOVA and
regression
 Linear models tend to get fitted to data – as they are less complex and non-
linear models are rarely taught

Non-Linear Models  Can be a good fit for some types of data/research
 Rarely taught – thus, rarely used



Data can be represented on a scatterplot – in which each dot represents a certain score

Consequences of Using Mainly Linear Models

1) Many published statistical models may not be the ones that fit best – bc authors did
not try non-linear models
2) Findings may have been missed because a linear model was a poor fit – and scientists
gave up rather than fitting non-linear models

It is best to plot the data first – if the plot seems to suggest a non-linear model, the do not
apply a linear model

Statistical Models – Main Equation

Everything in statistics boils down to one equation:

outcome i=( model ) +error i

This equation means that the data we observe can be predicted from the model we choose +
some amount of error

- Where, the i refers to the ith score => reflecting the fact that the value of the
outcome and the error will be different for each person

The ‘model’ in the equation will vary depending on:

(1) The design of the study

, (2) The type of data
(3) The aim of using the model

We predict an outcome variable from some model – but we do so imperfectly – therefore,
there is some error in there

Populations and Samples
Scientists are interested in finding results that apply to an entire population of entities (=>
generalizable)

A population can be (1) very general – e.g., all human beings – or (2) very narrow – e.g., all
male ginger cats

Typically, scientists strive to infer things about general populations rather than narrow ones

- As such findings and conclusions have a much wider impact

There is rarely access to every member of a population – therefore, data is collected from a
smaller subset of the population – i.e., a sample

- The data is then used to infer things about the population as a whole

The bigger the sample => the more likely it is to reflect the whole population

- The results from random samples will give slightly different results – but on
average, the results from large samples would be similar

P is for Parameters
Parameters are the P in the SPINE of statistics

Statistical models are made up of variables and parameters

1) Variables – i.e., measured constructs that vary across entities in the sample
2) Parameters – i.e., they are not measured and are constants believed to represent
some fundamental truth about the relations between variables in the model
- E.g., parameters include the mean and median – i.e., estimate the center of the
distribution – and the correlation and regression coefficients – i.e., which
estimate the relationship between two variables

,Case (1) – In cases in which one is only summarizing the outcome (=> as we are when
computing the mean) then, there will be no variables in the model – only a parameter:

outcome i=( b 0 ) +error i

Case (2) – In cases in which we want to predict an outcome from a variable => expand the
model to include this variable (predictor variables are denotes with X):

outcome i=( b 0 +b1 X i ) +error i

This equation predicts the value of the outcome for a particular entity (=> i) – not just from
the value of the outcome when there are no predictors (=> b 0)

- But from the entity’s score on the predictor variable (=> X i )

The predictor variable has a parameter (=> b 1) attached to it

- This parameter tells us something about the relationship between the predictor X i
and the outcome

Case (3) – In cases when predicting an outcome from two predictors => add another
predictor to the model:

outcome i=( b 0 +b1 X 1 i+ b2 X 2 i ) +error i

This model predicts the value of the outcome for a particular entity i from the value of the
outcome when there are no predictors b 0 and the entity’s score on two predictor variables ( X 1 i
and X 2 i )

Each predictor variable has a parameter (b 1 , b2) attached to it => tells us something about the
relationship between that predictor and the outcome

In Summary – values of an outcome variable can be predicted based on a model

The form of model changes – but there will always be some error in prediction

- And there will always be parameters that tell us about the shape or form of the
model

Working Out How the Model Looks

,In order to work out what the model looks like => estimate the parameters (i.e., the values
of b)

- We want to know what our model may look like in the whole population =>
parameter estimates

The model is defined by parameters – as such, we are not interested in the parameter values
in the sample => interested in the parameter values in the population

The sample data can only be used to estimate the population parameter values – since we did
not measure the population, but only the sample



The Mean as a Statistical Model


The mean value is a hypothetical value – i.e., it is a model created to summarize the data and
there will be error in prediction

The model is:

outcome i=( b 0 ) +error i

In which the parameter, b 0 => is the mean of the outcome

- The value of the mean/parameter computed in a sample – can be used to estimate
the value in the population

outcome i=( b^ 0 ) + error i

When referring to an estimate => add a hat on top as the parameter does not represent the
true value and explicitly express that the value is an estimate



Assessing the Fit of a Model – Sum of Squares and Variance


With most statistical models – can determine whether the model represents the data well by
looking at how different the scores observed in the data are from the values that the model
predicts

,Estimating Model Fit for Particular Entity

Given that a model predicted a mean of 2.6 for the outcome of 1 entity=> in order to calculate
the error, fill in and rearrange the equation: outcome i=( b^ 0 ) + error i

1=2.6 +error 1error 1=1−2.6 = -1.6

As such => we have just calculated the deviance – i.e., the error

deviance=outcome i−model i

The error/deviance for a particular entity => the score predicted by the model for that entity
subtracted from the corresponding observed score

The line representing the mean can be
thought of as our model

- The dots are the observed
data

The diagram has a series of vertical lines
that connect each observed value to the
mean value

- These represent the
error/deviance of the model
for each entity

A negative number (e.g., -1.6) => shows that the model overestimates the actual value

Estimating the Model Fit Overall

We cannot add deviances => some errors are positive and others negative which would result
in a total of zero:

total error=∑ of errors

❑ni=1 ( outcome i−model i )=0

The solution to this problem => square the errors:

∑ of squared errors ( SS )=❑in=1 ( outcome i−model i )2

,Specific Models

When thinking about a specific model – i.e., such as when the equation was specific to when
n 2 n 2
the model is the mean => ❑i=1 ( outcome i−model i ) = ❑i=1 ( x i−x́ )

General Models

Think of the total error in terms of this general equation:

n 2
total error=❑i=1 ( observedi −modeli )

This equation shows how the SS can be used to assess the total error in any model – not just
the mean

The SS is a good measure of the accuracy of the model – and it depends on the quantity of
data collected (the more data points => the higher the SS)

- This problem is overcome by using the average error rather than the total

Computing the average error => divide the SS (i.e., total error) by the number of values (i.e.,
N) that we used to compute the total

Estimating the Mean Error in Population

Estimated by:

1. Divide by the degrees of freedom (df) – i.e., the number of scores used to compute
the total adjusted for the fact that we are trying to estimate the population value

n 2
SS ❑i =1 ( outcome i−model i )
mean squared error= =
df N−1

This is a more general form of the equation for variance => the above equation can be easily
transformed into the one for variance:

n 2
SS ❑i =1 ( x i−x❑ )
mean squared error= =
df N−1

Summary

The sum of squared errors (SS) and the mean squared error (i.e., the square root of variance)
=> can be used to assess the fit of a model

- Large values relative to the model => indicate a lack of fit

,SS => used to assess the total error in any model; measure of accuracy of a model

͢ Depends on quantity of data collected
͢ The more data => the higher the SS

The mean squared error (MS) – or the variance – is the average error in the model in the
population



E is for Estimating Parameters


Equations for estimating parameters are based on the principle of minimizing error –
providing the parameter that has the least error given the data

The principle of minimizing the sum of squared errors (SS) – known as the method of least
squares or ordinary least squares (OLS)




The equation for the mean is designed to estimate the parameter to minimize the error – i.e.,
the value that has the least error

,The equations obtain the lowest value of the SS errors – and the parameter value that results
is the lowest value of the SS

outcome i=( b^ 0 ) + error i

∑ of squared errors ( SS )=❑in=1 ( outcome i−model i )2

S is for Standard Error


The SD allows us to see how well the mean represents the sample data

The standard error – allows us to look at how representative the samples are of the
population of interest

When using a sample we calculate the average rating – i.e., the sample mean – however,
different samples will have different sample means

- This difference illustrates sampling variation – i.e., samples vary because they
contain different members of the population

Plotting sample means as a frequency distribution – or histogram (i.e., a graph of possible
values of the sample mean plotted against the number of samples that have a mean of that
value) – we would see the frequency of a given mean in the samples

- The end result is a distribution => known as a sampling distribution

Sampling Distribution

A sampling distribution – i.e., the frequency distribution of sample means from the same
population

The sampling distribution of the mean – tells us about the behavior of samples from the
population

- It is centered at the same value as the mean of the population

, If our observed data are sample means => the standard deviation of these sample means
would tell us how widely spread (i.e., how representative) sample means are around their
average

The average of the sample means = the population mean

- The standard deviation of the sample means => tells us how widely sample means
are spread around the population mean
 Tells us whether sample means are typically representative of the population
mean

Standard Error of the Mean (SE)

The SD of sample means – i.e., the standard error (SE)

The central limit theorem – i.e., argues that as samples get large (> 30) => the sampling
distribution has:

(1) A normal distribution with m = population M
s
(2) SD of => σ X=
√N
Therefore – if the sample is large (> 30) this equation can be used to approximate the
standard error (SE)

- Because it is the SD of the sampling distribution

When the sample is relatively small (< 30) – the sampling distribution is not normal

- It has a different shape – i.e., t-distribution

Summary

͢ The SE of the mean – i.e., the SD of sample means – is a measure of how
representative of the population a sample mean is likely to be

͢ A large SE => a lot of variability between the means of different samples => sample
mean may not be representative of population mean

͢ A small SE => indicates most sample means are similar to the population mean – i.e.,
sample mean is likely to accurately reflect population mean

I is for (Confidence) Interval

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller galinajimberry. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $7.50. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

67474 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$7.50  11x  sold
  • (0)
  Add to cart