Week 1
2.1 Variables and data
An observation is a single member of a collection of items that we want to study, such as a person,
firm, or region. An example of an observation is an employee or an invoice mailed last month. A
variable is a characteristic of the subject or individual, such as an employee’s income or an invoice
amount. The data set consists of all the values of all of the variables for all of the observations we
have chosen to observe. In this book, we will use data as a plural, and data set to refer to a collection
of observations taken as a whole. Data usually are entered into a spreadsheet or database as an n ×
m matrix.
A data set may consist of many variables. The questions that can be explored and the analytical
techniques that can be used will depend upon the data type and the number of variables. This
textbook starts with univariate data sets (one variable), then moves to bivariate data sets (two
variables) and multivariate data sets (more than two variables).
A data set may contain a mixture of data types. Two broad categories are categorical data and
numerical data.
Categorical data (also called qualitative data) have values that are described by words rather than
numbers. For example, structural lumber can be classified by the lumber type (e.g., fir, hemlock, pine),
automobile styles can be classified by size (e.g., full, midsize, compact, subcompact).
On occasion the values of the categorical variable might be represented using numbers. This is called
coding. Coding a category as a number does not make the data numerical and the numbers do not
typically imply a rank. But on occasion a ranking does exist. For example, a database might code
education degrees using numbers:
1 = Bachelor’s 2 = Master’s 3 = Doctorate
Numerical data (also called quantitative data) arise from counting, measuring something, or some
kind of mathematical operation. For example, we could count the number of auto insurance claims
filed in March (e.g.114 claims). Most accounting data, economic indicators, and financial ratios are
quantitative, as are physical measurements.
Numerical data can be broken down into two types. A variable with a countable number of distinct
values is discrete. Often, such data are integers. You can recognize integer data because their
description begins with “number of.” For example, the number of Medicaid patients in a hospital
waiting room (e.g., 2) or the number of takeoffs at Chicago O’Hare International Airport in an hour
(e.g., 37). Such data are integer variables because we cannot observe a fractional number of patients
or takeoffs.
A numerical variable that can have any value within an interval is continuous (e.g., 427.21 grams).
Sometimes we round a continuous measurement to an integer (e.g., 427 grams), but that does not
make the data discrete.
If each observation in the sample represents a different equally spaced point in time (years, months,
days), we have time series data. The periodicity is the time between observations. It may be annual,
quarterly, monthly, weekly, daily, hourly, etc.
2.2 Level of measurement
Data types shown in Figure 2.1 can be further classified by their measurement level. Statisticians
typically refer to four levels of measurement for data: nominal, ordinal, interval, and ratio.
, - Nominal measurement is the weakest level of measurement and the easiest to recognize.
Nominal data merely identify a category. “Nominal” data are the same as “categorical,” or
“classification” data. Did you file an insurance claim last month? 1. Yes 2. No
- Ordinal data codes connote a ranking of data values. For example: How often do you use
Microsoft Access?
1. Frequently 2. Sometimes 3. Rarely 4. Never
Like nominal data, these ordinal numerical codes lack the properties that are required to
compute many statistics, such as the average. Specifically, there is no clear meaning to the
distance between 1 and 2, or between 2 and 3, or between 3 and 4 (what would be the
distance between “Rarely” and “Never”?).
- The next step up the measurement scale is interval data. Interval data are used frequently
and are important in business. Interval data often arise from surveys where customers are
asked to rate their satisfaction with a service or product on a numerical scale. While these
scale points are expressed as numbers (e.g., 1–10), the scale is arbitrary, the numbers are
not a count or a physical measure, and the value “0” has no meaning. However, we can say
that the distances between scale points have meaning. The difference between a rating of 4
and 6 is treated the same as the difference between 7 and 9. Because intervals between
numbers represent distances, we can do mathematical operations such as taking an average.
But because the zero point of these scales is arbitrary, we can’t say that a customer who
rates our service an 8 is twice as satisfied as a customer who rates our service a 4. That is,
ratios are not meaningful for interval data. The absence of a meaningful zero is a key
characteristic of interval data.
The Likert scale is a special case that is frequently used in survey research. You have page
29 undoubtedly seen such scales. Typically, a statement is made and the respondent is asked
to indicate his or her agreement/disagreement on a five-point or seven-point scale using
verbal anchors.
- Ratio measurement is the strongest level of measurement. Ratio data have all the properties
of the other three data types, but in addition possess a meaningful zero that represents the
absence of the quantity being measured. Because of the zero point, ratios of data values are
meaningful (e.g., $20 million in profit is twice as much as $10 million). Balance sheet data,
income statement data, financial ratios, physical counts, scientific measurements, and most
engineering measurements are ratio data because zero has meaning (e.g., a company with
zero sales sold nothing). Having a zero point does not restrict us to positive data. For
example, profit is a ratio variable (e.g., $4 million is twice $2 million), yet firms can have
negative profit (i.e., a loss).
Lack of a true zero is often the quickest test to defrock variables masquerading as ratio data.
For example, a Likert scale (+2, +1, 0, −1, −2) is not ratio data despite the presence of zero
because the zero (neutral) point does not connote the absence of anything.
2.3 Sampling concepts
A sample involves looking only at some items selected from the population, while a census is an
examination of all items in a defined population. The accuracy of a census can be illusory. For
example, the 2000 U.S. decennial census is believed to have overcounted by 1.3 million people while
the 2010 census count is thought to have overestimated the U.S. population by only 36,000.
When the quantity being measured is volatile, there cannot be a census. For example, The Arbitron
Company tracks American radio listening habits using over 2.6 million “Radio Diary Packages.” For
each “listening occasion,” participants note start and stop times for each station.
From a sample of n items, chosen from a population, we compute statistics that can be used as
estimates of parameters found in the population. To avoid confusion, we use different symbols for
each parameter and its corresponding statistic. Thus, the population mean is denoted μ (the
,lowercase Greek letter mu) while the sample mean is 𝑥. The population proportion is denoted π, while
the sample proportion is p.
A population may be defined either by a list (e.g., the names of the passengers on Flight 234) or by a
rule (e.g., the customers who eat at Noodles & Company). The target population contains all the
individuals in which we are interested. The sampling frame is the group from which we take the
sample. If the frame differs from the target population, then our estimates might not be accurate.
Examples of frames are phone directories or marketing databases.
3.1 Stem-and-leaf displays
Statistics offers methods that can help organise, explore, and summarise data in a succinct way. The
methods may be visual (charts and graphs) or numerical (statistics or tables). In this chapter, you will
see how visual displays can provide insight into the characteristics of a data set without using
mathematics. The type of graph you use to display your data is dependent on the type of data you
have. Some charts are better suited for quantitative data, while others are better for displaying
categorical data.
Before calculating any statistics or drawing any graphs, it is a good idea to look at the data and try to
visualise how they were collected.
As a first step, it is helpful to sort the data. From the sorted data, we can see the range, the frequency
of occurrence for each data value, and the data values that lie near the middle and ends.
When the number of observations is large, a sorted list of data values is difficult to analyse. Further, a
list of numbers may not reveal very much about centre, variability, and shape. To see
broader patterns in the data, analysts often prefer a visual display of the data.
One simple way to visualise small data sets is a stem-and-leaf plot. The
stem-and-leaf plot is a tool that seeks to reveal essential data features in an intuitive
way.
A dot plot is another simple graphical display of n individual values of numerical data.
If more than one data value lies at approximately the same X-axis location, the dots
are piled up vertically.
A stacked dot plot can be used to compare two or more groups. For example, the
figure shows a stacked dot plot for median home prices for 150 U.S. cities in four
different regions. The same X-axis scale is used for all four dot plots.
3.2 Frequency Distributions and Histograms
A frequency distribution is a table formed by classifying n data values into
k classes called bins. The bin limits define the values to be included in
each bin. Usually, all the bin widths are the same. The table shows the
frequency of data values within each bin. Frequencies also can be
expressed as relative frequencies or percentages of the total number of
observations. The steps of making a frequency distribution are: (1) find the
smallest and largest data values, (2) choose the number of bins, (3) set
bin limits, (4) count the data values in each bin, and (5) prepare a table.
A histogram is a graphical representation of a frequency distribution. A
histogram is a column chart whose Y-axis shows the number of data values
(or a percentage) within each bin of a frequency distribution and whose
X-axis ticks show the end points of each bin. There should be no gaps
, between bars (except when there is no data in a particular bin). Histograms can have multiple shapes
as shown in the figure.
A frequency polygon is a line graph that connects the midpoints of the histogram bin intervals, plus
extra intervals at the beginning and end so that the line will touch the X-axis. It serves the same
purpose as a histogram but is attractive when you need to compare two data sets (because more than
one frequency polygon can be plotted on the same scale). An ogive is a line graph of the cumulative
frequencies. It is useful for finding percentiles or in
comparing the shape of the sample with a known benchmark
such as the normal distribution.
3.4 Line Charts
A line chart like the one shown in Figure 3.11 is used to display a time series, to spot
trends, or to compare time periods. Line charts can be used to display several variables at
once. If two variables are displayed, the right and left scales can differ, using the right scale for one
variable and the left scale for the other.
3.5 Column and Bar Charts
A column chart is a vertical display of data and a bar chart is a horizontal
display of data. The column chart is probably the most common type of
data display in business. Attribute data are displayed using a column to
represent a category or attribute. The height of each column reflects a
frequency or a value for that category. Each column has a label showing
the category name.
A special type of column chart used in business is the Pareto chart. A Pareto chart
displays categorical data, with categories displayed in descending order of frequency,
so that the most common categories appear first. Typically, only a few categories
account for the majority of observations.
3.6 Pie Charts
Many statisticians feel that a table or bar chart is often a better choice than a pie chart. But, because
of their visual appeal, pie charts appear daily in company annual reports and the popular press (e.g.,
USA Today, The Wall Street Journal, Scientific American), so you must understand their uses and
misuses. A pie chart can only convey a general idea of the data because it is hard to assess areas
precisely. It should have only a few slices (typically two to five) and the slices should be labelled with
data values or percentages. The only correct use of a pie chart is to portray data that sum to a total
3.7 Scatter Plots
A scatter plot shows n pairs of observations (x1, y1), (x2, y2), . . ., (xn, yn) as dots (or some other
symbol) on an X-Y graph. This type of display is so important in statistics that it
deserves careful attention. A scatter plot is a starting point for bivariate data
analysis. We create scatter plots to investigate the relationship between two
variables. Typically, we would like to know if there is an association between
two variables and, if so, what kind of association exists.