The practice of statistics in the life sciences
Chapter 1: Picturing Distributions with Graphs
Individuals and variables
Individuals are the objects (or units) described by a set of data. Individuals may be people, but they
may also be animals, plants, or things.
A variable is any one characteristic of an individual. A variable can take different values for different
individuals.
Population = the entire group of individuals about which we want information.
Identifying categorical and quantitative variables
A categorical variable places an individual into one of several groups or categories. Ordinal data can
be ranked, they are not true quantitative variables, because the intervals between constructive ranks
are often not identical.
A quantitative variable takes numerical values for which arithmetic operations such as adding and
averaging makes sense. The values of a quantitative variable are usually recorded in a unit of
measurement such as seconds or kilograms.
Example:
The percent of obese individuals in each state, the individuals studied are the 50 states, not the
people. Each state provides a meaningful numerical value. The variable here is ‘’percent of the
population who are obese,’’ and it is a quantitative variable.
Exploring data
1. Examining each variable by itself, then to the relationships among the variables.
2. Start with graph or graphs, then add numerical summaries of specific aspects of the data.
The proper choice of graph depends on the nature of the variable. To examine a single variable, we
usually want to display its distribution.
The distribution of a variable tells us what values it takes and how often it takes these values. The
values of a categorical variable are labels for the categories. The Distribution of a categorical
variable list the categories and gives either the count or the percent of individuals that fall in each
category.
,Counts are also sometimes referred to as frequencies, and percent’s as relative frequencies. The
percent’s should add to 100% or, because each percent in the table is rounded to the nearest
integer, very nearly 100%. Round off errors don’t point to mistakes in our work, just to the effect of
rounding off results.
Charts and graphs:
- Pie chart can represent only one variable in one group at a time and must include all the
categories that make up a whole.
o Pie charts must include all the categories that make up a whole (one variable in one
group)
- Bar graphs are particularly adept at pointing out the order and the relative importance of the
different categories.
o Tallest bar appears first, followed by the second-tallest bar, the bares could be
sorted alphabetically.
o Bar graphs are more flexible, they can be used to compare groups and do not
necessarily display all possible outcomes of a variable.
- Stacked bar graph, in each group, individuals who did something and individuals who did not
together make up 100% of that age group.
o It is important to understand to interpret the data correctly. One group van
represent a much smaller share of the whole population. But of this smaller group
has a larger percent’s of the variable. So it can have the highest rate of the variable
but they may not represent a large fraction of all individuals.
Quantitative variables: histograms
The distribution of a variable tells us what values the variable takes and how often it takes these
values. A graph of the distribution is often easier to interpret if nearby values are grouped together.
The most common graph of the distribution of one quantitative variable is a histogram. A summary
graph of a single variables. Although histograms resemble bar graph in some aspects, their details
and uses are very different. A histogram displays the distribution of one quantitative variable.
To make a histogram of the distribution of this variable:
Step 1: Choose the classes
The range of values that the quantitative variable takes is divided into equal-size intervals, or classes.
This makes the horizontal axe. Divide the range of the data into classes of equal width. If the data
range from 9.4 to 22.8 feet you use these classes:
9.0 < individuals with body length ≤ 11.0
11.0 < individuals with body length ≤ 13.0
⁞
21.0 < individuals with body length ≤ 23.0
Choosing, instead, to include the lower bound and exclude the upper bound would also have been a
valid option. What matters is specifying the classes precisely so that each individual falls into exactly
one class. You can explain the nature of the class boundaries in the legend accompanying your
histogram. Data that is 11.0 still falls in the first class.
There is no right choice of the classes in a histogram. Too few classes will give a skyscraper graph and
to many classes will produce a pancake graph. Try starting with 5 to 10 classes, then refine your class
choice. Software will choose the classes for you. On spss the function one-variable statistical
calculator allows you to change the numbers of classes.
,Step 2: Count the individuals
Check that the counts, the number of individuals in the data and that percent’s add to 100 up to
round off error.
Step 3: Draw the histogram
- Mark the scale for the variable whose distribution you are displaying on the horizontal axis. The
scale runs the span of the classes we chose. The horizontal axis represents a continuum of values
=, and therefore the histogram does not leave any horizontal space between the bars unless a
class is empty (in which case the bar has height zero).
- The vertical axis contains either the scale of count or the scale of percent’s. A histogram of
percent’s rather than counts is convenient when we want to compare several distribution.
- Each bar in the histogram represents a class. The base of the bar covers the class, and the bar
height is the class percent.
Interpreting histograms
In any graph of data, look for the overall pattern and for striking deviations from that pattern. You
can describe the overall pattern of a histogram by its shape, centre and spread. An important kind of
deviation is an outlier, and individual value that falls outside the overall pattern.
Shape
The distribution is unimodal and has a single peak. The distribution in mathematics are exact mirror
images. Real data are almost never exactly symmetric. Left skew, the left side ends much farther out
then the right side. Right skew, other way around.
Outliers
Once you have spotted possible outliers, look for an explanation. Some outliers are due to mistakes,
such as typing. Other outliers point to special nature of some observations. An outlier could also
simply be an unusual but perfectly legitimate observation.
Choice of classes in a histogram can influence the appearance of a distribution. When you describe a
distribution, concentrate on the main features. Look for major peaks, not for minor ups and downs in
the bars of the histogram. Look for clear outliers, not just for the smallest and largest observations.
Look for rough symmetry or clear skewness.
In some cases there are two clusters of individuals, which results in a bimodal distribution. So we
can’t call this irregular distribution either symmetric or skewed. Giving a single centre and spread for
this distribution would be misleading, because the data suggest two age groups. It would be better to
describe the two groups separately.
The overall shape of a distribution provides important information about a variable. Some variables
have distributions with predictable shapes. Many distributions have irregular shapes that are neither
symmetric nor skewed. Do not try to artificially manipulate the classes of a histogram so that the
data appear more symmetrical or more regular. Instead, accept that not all data have a distribution
that follows a neat pattern, even if you could obtain a larger data set. Use your eyes, describe what
you see, and then try to explain it.
Quantitative variables: Dotplots
Dotplots are also commonly used to display the distribution of quantitative data, especially for small
data sets. They have the added advantage of displaying the raw data; that is they show each one of
the values of the data set.
, To make a dotplot:
- Sort the data set and plot each observation according to its numerical value along a labelled
scaled axis.
- Identical observations are typically stacked.
Like a histogram, the dotplot shows the shape, centre, and spread of the distribution as well as
potential outliers. However, because the dotplot is one-dimensional, the distribution’s shape and
centre are indicated by the density of dots rather than the height of the histogram bar. Dotplots with
neatly stacked dots are easy to draw by hand for reasonably small data sets.
Time plots
Many variables are measured at intervals over time. To display change over time, make a time plot. A
time plot of a variable plots each observation against the time at which it was measured. Always put
time on the horizontal scale of your plot and the variable you are measuring on the vertical scale.
Connecting the data points by lines helps emphasize any change over time. When you examine a
time plot, look once again for an overall pattern and for strong deviations from the pattern.
Histograms and time plots give different kinds of information about a variable. The time plot presents
time series data that show the change in annual global temperature anomalies over time. A
histogram displays the distribution of data, such as the temperature anomalies, regardless of time.
Chapter 1 summary
- A data set contains information on a number of individuals. Individuals may be people, animals,
or things. For each individual, the data gives values for one or more variables. A variable
describes some characteristic of an individual, such as a person’s height, sec or age.
- Some variables are categorical, and others are quantitative. A categorical variable places each
individual into a category, such as male or female. A quantitative variable has numerical values
that measure some characteristic of each individual, such as height in centimeters or age in
years.
- Exploratory data analysis uses graphs and numerical summaries to describe the variables in a
data set and the relations among them.
- After you understand the background of your data (individuals, variables, unit of measurement),
the first thing to do almost always is plot your data.
- The distribution of a variable describes what values the variable takes and how often it takes
these values. Pie charts and bar graphs display the distribution of a categorical variable. Bar
graphs van also compare any set of quantities measured in the same units. Histograms and
dotplots display the distribution of a quantitative variable.
- When examining any graph, look for an overall pattern and for notable deviations from the
pattern.
- Shape, centre, and spread describe the overall pattern of the distribution of a quantitative
variable. Some distributions have simple shapes, such as symmetric or skewed. Not all
distributions have a simple overall shape. Describing the shape of a distribution when there are
few observations can be particularly challenging.
- Outliers are observations that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
- When observations on a variable are taken over time, make a time plot that graphs time
horizontally and the values of the variable vertically. A time plot can reveal trends, cycles, or
other changes over time.