Introduction to the Practice of Statistics
1.2 Displaying distributions with graphs
Statistical tools and ideas help us examine data to describe their main features. This
examination is called exploratory data analysis. There are two basic strategies that help us
organize our exploration of a set of data: we can begin by examining each variable by itself
or we can start with a graph.
Categorical variables: bar graphs and pie charts
The values of a categorical variable are labels for the categories, such as ‘yes’ and ‘no’. The
distribution of a categorical variable lists the categories and gives either the count or the
percent of cases that fall in each category. For data sets that have a large number of values
for a categorical variable, an ‘other’-category is often created.
The categories in a bar graph can be put in any order, but you should always consider the
best way to order the values of the categorical variable in a bar graph. A bar graph using
counts will look the same as a bar graph using percents. To make a pie chart, you must
include all the categories that make up a whole, so the sum of the percents for all categories
should be 100%. This constraint makes bar graphs more flexible.
Quantitative variables: stemplots and histograms
A stemplot gives a quick picture of the shape of a distribution while including the actual
numerical values in the graph. Stemplots work best for small number of observations that
are all greater than 0. When you wish to compare two related distributions, a back-to-back
stemplot with common stems is useful. The leaves on each side are ordered out from the
common stem.
A histogram breaks the range of values of a variable into classes and displays only the count
or percent of the observations that fall into each class. You can choose any convenient
number of classes, but you should choose classes of equal width. Histograms do not display
the actual values observed. Therefore, we prefer stemplots for small data sets. Large sets of
data are often reported in the form of frequency tables when it is not practical to publish the
individual observations.
Our eyes respond to the area of the bars in a histogram. Because the classes are all the
same width, area is determined by height and all classes are fairly presented. Too few
classes will give a ‘skyscraper’ graph, whereas too many will produce a ‘pancake’ graph.
Neither choice will give a good picture of the shape of the distribution. You must use your
judgement in choosing classes to display the shape.
Although histograms resemble bar graphs, their details and uses are distinct. A histogram
shows the distribution of counts or percents among the values of a single variable, whereas
a bar graph compares the counts or percents of different items. Bar graphs are drawn with
blank space between the bars to separate the items, whereas histograms are drawn with no
space, to indicate that all values of the variable are covered.
Examining distributions
Making a statistical graph is not an end in itself. The purpose of the graph is to help us
understand the data. Once you have displayed a distribution, you can see its important
features as follows. Some things to look for in describing shape are:
- does the distribution have one or several major peaks (modes)?
- is it approximately symmetric or is it skewed in one direction?
- are there any outliers?
1.3 Describing distributions with numbers
,Introduction to the Practice of Statistics
A brief description of the distribution of a quantitative variable should include its shape and
numbers describing its center and spread. To interpret measures of center and spread, and
to choose among the several measures, you must think about the shape of the distribution
and the meaning of the data. The numbers, like graphs, are aids to understanding, not ‘’the
answer’’ in themselves.
Measuring center: the mean and the median
Numerical description of a distribution begins with a measure of its center or average. The
two common measures of center are the mean and the median. The mean is the ‘’average
value’’ and the median is the ‘’middle value’’. To find the mean of a set of observations, add
their values and divide by the number of observations.
The median is the midpoint of a distribution: half the observations are smaller than the
median and the other half are larger than the median. With the formula (n+1)/2 you get the
location of the median in the ordered list. The median is more resistant than the mean: it is
less influenceable by outliers.
The mean and median of a symmetric distribution are close together. If the distribution is
exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the
mean is farther out in the long tail than is the median.
Measuring spread: the quartiles
A measure of center alone can be misleading, so we are interested in the spread or
variability of variables as well as their centers. The simplest useful numerical description of a
distribution consists of both a measure of center and a measure of spread.
We can describe the spread or variability of a distribution by giving several percentiles. The
median, which is called the 50th percentile, divides the data in two. The upper quartile is the
median of the upper half of the data and the lower quartile is the median of the lower half of
the data. We can do a similar calculation for any percent. The pth percentile of a distribution
is the value that has p percent of the observations fall at or below it.
The five-number summary and boxplots
The five-number summary of a set of observations consists of the smallest observation, the
first quartile, the median, the third quartile, and the largest observation, written in order from
smallest to largest. In symbols, the five-number summary is:
Minimum Q1 M Q3 Maximum
A boxplot is a graph of the five-number summary:
- A central box spans the quartiles Q1 and Q3;
- A line in the box marks the median M;
- Lines extended from the box out to the smallest and largest observations (whiskers).
The 1.5 x IQR rule for suspected outliers
The smallest and largest observations are extremes that do not describe the spread of the
majority of the data. The distance between the quartiles is a more resistant measure of
spread than the range. This distance is called the interquartile range: IQR = Q3 - Q1.
However, no single numerical measure of spread, such as IQR, is very useful for describing
skewed distributions. The two sides of a skewed distribution have different spreads, so one
number can’t summarize them. The interquartile range is mainly used as the basis for a rule
of thumb for identifying suspected outliers: if an observation falls more than 1.5 x IQR above
the third quartile or below the first quartile, it is a suspected outlier.
Measuring spread: the standard deviation
,Introduction to the Practice of Statistics
The five-number summary is not the most common numerical description of a distribution.
That distinction belongs to the combination of the mean to measure center and the standard
deviation to measure spread, or variability. The standard deviation measures spread by
looking how far the observations are from their mean.
The variance s2 of a set of observations is the average of the squares of the deviations of
the observations from their mean (divided by n-1). The standard deviation s is the square
root of the variance s2.
Properties of the standard deviation:
- s measures spread about the mean and should be used only when the mean is
chosen as the measure of center.
- s = 0 only when there is no spread. This happens only when all observations have
the same value. Otherwise, s > 0. As the observations become more spread out
about their mean, s gets larger.
- s, like the mean, is not resistant. A few outliers can make s very large.
The use of squared deviations renders s even more sensitive than the mean to a few
extreme observations.
Choosing measures of center and spread
How do we choose between the five-number summary and the mean and standard deviation
to describe the center and spread of a distribution? Because the two sides of a strongly
skewed distribution have different spreads, no single number such as s describes the spread
well. The five-number summary, with its two quartiles and two extremes, does a better job.
Remember that a graph gives the best overall picture of a distribution. Numerical summaries
do not disclose the presence of multiple modes or gaps, for example. Always plot your data.
Changing the unit of measurement
The same variable can be recorded in different units of measurement. Fortunately, it is easy
to convert numerical descriptions of a distribution from one unit of measurement to another.
This is true because a change in the measurement unit is a linear transformation of the
measurements. A linear transformation changes the original variable x into the new variable
xnew given by an equation of the form xnew = a + bx. Adding the constant a shifts all values of
x upward or downward by the same amount. Multiplying by the positive constant b changes
the size of the unit of measurement.
Linear transformations do not change the shape of a distribution, although the center and
spread will change. Fortunately, the changes follow a simple pattern.
- Multiplying each observation by a positive number b multiplies both measures of
center and measures of spread by b.
- Adding the same number a to each observation adds a to measures of center and to
quartiles and other percentiles but does not change measures of spread.
1.4 Density Curves and Normal Distributions
We now have a clear strategy for exploring data on a single quantitative variable:
1. Always plot your data: make a graph, usually a stemplot or a histogram;
2. Look for the overall pattern and for striking deviations such as outliers;
3. Calculate an appropriate numerical summary to briefly describe center and spread.
, Introduction to the Practice of Statistics
Technology has expanded the set of graphs that we can choose for step 1. It is possible to
make histograms by hand, but using software, clever algorithms can describe a distribution
in a way that is not feasible by hand, by fitting a smooth curve to the data in addition to or
instead of a histogram. The curves used are called density curves. A smooth density curve is
an idealization that gives the overall pattern of the data but ignores minor irregularities.
A density curve is a curve that
- is always on or above the horizontal axis;
- has area exactly one underneath it.
A density curve describes the overall pattern of a distribution. The area under the curve and
above any range of values is the proportion of all observations that fall in that range. Density
curves, like distributions, come in many shapes. A density curve of an appropriate shape is
often an adequate description of the overall pattern of a distribution. Outliers, which are
deviations from the overall pattern, are not described by the curve.
Measuring center and spread for density curves
Our measures of center and spread apply to density curves as well as to actual sets of
observations, but only some of these measures are easily seen from the curve.
- Mode: the peak point of the curve;
- Median: the point with half the total area on each side;
- Quartiles: quarters of the area under the curve, as accurately as possible by eye;
- IQR: the distance between the first and third quartiles;
- Mean: the balance point, at which the curve would balance if made of solid material;
- Median: the equal-areas point, the point that divides the area under the curve in half.
The median and mean are the same for a symmetric density curve. They both lie at the
center of the curve. The mean of a skewed curve is pulled away from the median in the
direction of the long tail.
A density curve is an idealized description of a distribution of data. Density curves can be
exactly symmetric, whereas the histogram can be only approximately symmetric. We
therefore need to distinguish between the mean and standard deviation of the density curve
and the numbers x̄ and s computed from the actual observations. The usual notation for the
mean of an idealized distribution is 𝜇. We write the standard deviation of a density curve as
𝜎.
Normal distributions
One particularly important class of density curves are Normal curves, which describe Normal
distributions. They are symmetric, unimodal and bell-shaped. The mean is located at the
center of the symmetric curve and is the same as the median. 𝜇 and 𝜎 do completely
determine the shape of a Normal distribution and 𝜎 can be located by eye. The points at
which a change of curvature takes place indicate the standard deviation. Why are Normal
distributions important in statistics?
1. Normal distributions are good descriptions for some distributions of real data;
2. Normal distributions are good approximations to the results of many kinds of chance
outcomes, such as tossing a coin many times;
3. Many statistical inference procedures based on Normal distributions work well for
other roughly symmetric distributions.
The 68-95-99.7 rule
Although there are many Normal curves, they all have common properties. The 68-95-99.7
rule is one of the most important. In the Normal distribution with mean 𝜇 and standard
deviation 𝜎