Applied Statistics - Summary Block 4,
Academic Year 09-10
Chapter 1 – Examining distributions
Statistics is a way to get information from data, by collecting, analyzing and
interpreting these data in order to get insight in world-wide phenomena and to
assist in decision making processes.
Overall pattern of a distribution can be described by its center, spread and
shape:
1 1 1 1
Skewed to the left Symmetric Skewed to the right
Different ways to graph the distribution of quantitative variables are:
• Stemplot; A stemplot separates each observation in a stem (from low to high)
and a leaf.
• Timeplot; A timeplot plots each observation against the time at which it was
measured.
• Pie chart; A pie chart shows relative frequency in a circle of 100%.
• Barchart; A barchart shows the frequency in a diagram, whereas the sum
doesn’t have to equal 0.
• Pareto chart; A Pareto chart is a sorted barchart.
• Histogram; A histogram shows the distribution of something.
To interpret these graphs, you have to look at the overall pattern, for which you
can use the center, spread and shape. Watch out with these data for outliers.
Describing distributions with numbers:
1. Location; Mean (average) or median (odd number of observations: the middle
observation/even number of observations: the average of the two middle
observations).
2. Spread; Range (maximum-minimum), variance* (average squared deviation
around the mean), standard deviation (square root of the variance)and the
interquartile
∑ ( xi − x ) 2 range** (IQR=Q3-Q1).
1 n
s2 =
*= n − 1 i =1
**= Q1 is median of first half, Q3 is median of second half, IQR is the difference
between them.
A five number summary gives a quick overview of the distribution. It shows the
minimum, the first quartile (Q1), the median, the third quartile (Q3) and the
maximum. It can be shown in a box plot.
A density curve describes a mathematical description of the distribution. It is a
idealized description of data. The most frequently used distribution is the normal
distribution, which is symmetric and unimodal. Calculations about this
, distribution can be made by his z-score (z=(x-μ)/σ) or with your graphical
calculator with the function normalcdf(…).
To determine whether the distribution is normal, we have to look at the normal
quantile plot, which is the relationship between the z-scores and the observed
values. It’s a straight line if the distribution is normal.
Chapter 2 – Examining relationships
A scatterplot shows the relationship between two quantitative variables
measured on the same individuals. Each observation is depicted as one point
with, as x-coordinate, the value of one variable and, as y-coordinate, the value of
the other variable. We can use a scatterplot to determine the shape of the
relationship (line, cluster etc.).
The correlation measures the direction and strength of
the linear relationship between two quantitative
variables. The correlation is the mean product of the z-
scores of x and y. A positive relationship gives high x
and high y (or low x, low y) and a negative relationship
shows a low x and a high y (or high x, low y).
The correlation only captures linear relationships and it
is always between -1 and 1.
A regression line is a straight line that describes how
a response variable y changes as an explanatory
variable x changes. The most common used regression
line is the least-squares regression line, which is the straight line y=b0+b1x that
minimizes the sum of the squares of the vertical distances of the observed points
from the line. The slope b1 gives the rate of change of y if x changes with one
unit. The formula is b1=r*(sy/sx), where r is the correlation, sy the standard
deviation of y and sx the standard deviation of x. The intercept b0 is the value of y
if x=0.
A residual is the difference between an observed value and the predicted value
by the regression line. A residual plot is a scatterplot of the regression residuals
against the explanatory variable. Residual plots help
us assess the fit of a regression line.
In the first residual plot we see that the regression
line fits quite well, but in the second residual plot we
see that there might be a curved, non-linear
relationship instead of the straight line regression line.
The second ploy has a curved pattern.
Outliers are influential observations, but it’s not needed to have large residuals
due to the outliers.
Differences between correlation and regression (both measures sensitive to
outliers):
Correlation Regression
Goal: Measure for strength and Prediction from one variable by
direction of relationship another using a straight line.
between two quantitative
variables.
Role Both variables have the same There is one response variable
variables: role. y and one explanatory variable
x.
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller rutger91. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $3.79. You're not tied to anything after your purchase.