Applied Statistics - Summary Block 4,
Academic Year 09-10
Chapter 1 – Examining distributions
Statistics is a way to get information from data, by collecting, analyzing and
interpreting these data in order to get insight in world-wide phenomena and to
assist in decision making processes.
Overall pattern of a distribution can be described by its center, spread and
shape:
1 1 1 1
Skewed to the left Symmetric Skewed to the right
Different ways to graph the distribution of quantitative variables are:
• Stemplot; A stemplot separates each observation in a stem (from low to high)
and a leaf.
• Timeplot; A timeplot plots each observation against the time at which it was
measured.
• Pie chart; A pie chart shows relative frequency in a circle of 100%.
• Barchart; A barchart shows the frequency in a diagram, whereas the sum
doesn’t have to equal 0.
• Pareto chart; A Pareto chart is a sorted barchart.
• Histogram; A histogram shows the distribution of something.
To interpret these graphs, you have to look at the overall pattern, for which you
can use the center, spread and shape. Watch out with these data for outliers.
Describing distributions with numbers:
1. Location; Mean (average) or median (odd number of observations: the middle
observation/even number of observations: the average of the two middle
observations).
2. Spread; Range (maximum-minimum), variance* (average squared deviation
around the mean), standard deviation (square root of the variance)and the
interquartile
∑ ( xi − x ) 2 range** (IQR=Q3-Q1).
1 n
s2 =
*= n − 1 i =1
**= Q1 is median of first half, Q3 is median of second half, IQR is the difference
between them.
A five number summary gives a quick overview of the distribution. It shows the
minimum, the first quartile (Q1), the median, the third quartile (Q3) and the
maximum. It can be shown in a box plot.
A density curve describes a mathematical description of the distribution. It is a
idealized description of data. The most frequently used distribution is the normal
distribution, which is symmetric and unimodal. Calculations about this
, distribution can be made by his z-score (z=(x-μ)/σ) or with your graphical
calculator with the function normalcdf(…).
To determine whether the distribution is normal, we have to look at the normal
quantile plot, which is the relationship between the z-scores and the observed
values. It’s a straight line if the distribution is normal.
Chapter 2 – Examining relationships
A scatterplot shows the relationship between two quantitative variables
measured on the same individuals. Each observation is depicted as one point
with, as x-coordinate, the value of one variable and, as y-coordinate, the value of
the other variable. We can use a scatterplot to determine the shape of the
relationship (line, cluster etc.).
The correlation measures the direction and strength of
the linear relationship between two quantitative
variables. The correlation is the mean product of the z-
scores of x and y. A positive relationship gives high x
and high y (or low x, low y) and a negative relationship
shows a low x and a high y (or high x, low y).
The correlation only captures linear relationships and it
is always between -1 and 1.
A regression line is a straight line that describes how
a response variable y changes as an explanatory
variable x changes. The most common used regression
line is the least-squares regression line, which is the straight line y=b0+b1x that
minimizes the sum of the squares of the vertical distances of the observed points
from the line. The slope b1 gives the rate of change of y if x changes with one
unit. The formula is b1=r*(sy/sx), where r is the correlation, sy the standard
deviation of y and sx the standard deviation of x. The intercept b0 is the value of y
if x=0.
A residual is the difference between an observed value and the predicted value
by the regression line. A residual plot is a scatterplot of the regression residuals
against the explanatory variable. Residual plots help
us assess the fit of a regression line.
In the first residual plot we see that the regression
line fits quite well, but in the second residual plot we
see that there might be a curved, non-linear
relationship instead of the straight line regression line.
The second ploy has a curved pattern.
Outliers are influential observations, but it’s not needed to have large residuals
due to the outliers.
Differences between correlation and regression (both measures sensitive to
outliers):
Correlation Regression
Goal: Measure for strength and Prediction from one variable by
direction of relationship another using a straight line.
between two quantitative
variables.
Role Both variables have the same There is one response variable
variables: role. y and one explanatory variable
x.
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper rutger91. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €3,48. Je zit daarna nergens aan vast.