STATISTICS SUMMARY
Chapter 1 – An introduction to Business Statistics and Analytics
1.1 Data
Any data set provides information about some group of individual elements, which may be people,
objects, events, or other entities. The information that a data set provides about its elements usually
describes one or more characteristics of these elements. Any characteristic of an element is called a
variable.
For any variable describing an element in a data set, we carry out a measurement to assign a value
of the variable to the element. Element: a person, object, or other entity about which we wish to
draw a conclusion. Variable: a characteristic of a population or sample element. Measurement: the
process of assigning a value of a variable to an element in a population or sample.
- Cross-sectional data: are data collected at the same or approximately the same point in
time.
- Time series data: are data collected over different time periods. Time series plot (runs plot):
a plot of time series data versus time.
1.2 Data Sources, Data Warehousing, and Big Data
Primary data: are data collected by an individual or business directly through planned
experimentation or observation. Experimental study: a statistical study in which the analyst is able
to set or manipulate the values of the factors. Observational study: a statistical study in which the
analyst is not able to control the values of the factors.
Response variable: a variable of interest that we wish to study. Other variables, typically called
factors, that may be related to the response variable of interest will also be measured. Factors: a
variable that may be related to the response variable. When we are able to set or manipulate the
values of these factors, we have an experimental study.
1.3 Populations, Samples, and Traditional Statistics
A population is the set of all elements about which we wish to draw conclusions. We usually focus
on studying one or more variables describing the population elements. If we carry out a
measurement to assign a value of a variable to each and every population element, we have a
population of measurements (also called observations).
If we examine all of the population measurements, we way that we are conducting a census of the
population. Census: an examination of all the elements in a population. A sample is a subset of the
elements of a population. When we measure a characteristic of the elements in a sample, we have a
sample of measurements.
If the population is large and we need to select a sample from it, then we use statistical inference: is
the science of using a sample of measurements to make generalizations about the important aspects
of a population of measurements. What we might call traditional statistics consists of a set of
concepts and techniques that are used to describe populations and samples and to make statistical
inferences about populations by using samples. Random sample: a sample selected in such a way
that every set of n elements in the population has the same chance of being selected. Two related
extensions of traditional statistics have been developed to help analyse big data:
, 1
- Business analytics: the use of traditional and newly developed statistical methods, advances
in information systems, and techniques from management science to continuously and
iteratively explore and investigate past business performance, with the purpose of gaining
insight and improving business planning and operations.
- Data mining: the process of discovering useful knowledge in extremely large data sets.
1.4 Random Sampling and Three Case Studies That Illustrate Statistical Inference
If the information contained in a sample is to accurately reflect the population under study, the
sample should be randomly selected from the population.
We call n the sample size.
1. If we select n elements from a population in such a way that every set of n elements in the
population has the same chance of being selected, then the n elements we select are said to
be a random sample.
2. In order to select a random sample of n elements from a population, we make n random
selections – one at a time – from the population. On each random selection, we give every
element remaining in the population for that selection that same chance of being chosen.
In making random selections from a population, we can sample with or without replacement.
- Sample with replacement, we place the element chosen on any particular selection back
into the population. We give this element a chance to be chosen on any succeeding
selection.
- Sample without replacement, we do not place the element chosen on a particular selection
back into the population. We do not give this element a chance to be chosen on any
succeeding selection. It is best to sample without replacement. Because all of the elements
in the sample will be different, we will have the fullest possible look at the population.
Frame: a list of all the population elements. Infinite population: a population that is defined so that
there is no limit to the number of elements that could potentially belong to the population.
Random (or approximately random) sampling are types of probability sampling. Probability
sampling: sampling where we know the chance (probability) that each population element will be
included in the sample. If we employ probability sampling, the sample obtained can be used to make
valid statistical inferences about the sampled population.
One type of sampling that is not probability sampling is convenience sampling, where we select
elements because they are easy or convenient to sample. Voluntary response samples: sampling in
which the sample participants self-select. Another type of sampling that is not probability sampling
is judgment sampling, where a person who is extremely knowledgeable about the population under
consideration selects population elements that he or she feels are most representative of the
population.
1.5 Business Analytics and Data Mining
Cluster analysis: involves finding natural groupings, or clusters, within data without having to
prespecify a set of categories. A financial analyst might use cluster detection to define different
groupings of stocks based on the past history of stock price fluctuations.
Factor analysis: involves starting with a large number of correlated variables and finding fewer
underlying, uncorrelated factors that describe the ‘essential aspects’ of the large number of
correlated variables.
, 1
Predictive analytics fall into two classes:
- Nonparametric predictive analytics: make predictions by using a relationship between the
response variable and the predictor variables that is not expressed in terms of a
mathematical equation involving parameters. These analytics include decision tre es
(classification and regression trees), k-nearest neighbours, and naïve Bayes’ classification.
- Parametric predictive analytics: find a mathematical equation that relates the response
variable to the predictor variable(s) and involves unknown parameters that must be
estimated and evaluated by using sample data. Includes classical linear regression, logistic
regression, discriminate analysis, neural networks, and time series forecasting.
1.6 Ratio, Interval, Ordinal, and Nominative Scales of Measurement
Two types of quantitative variables:
- Ratio: a quantitative variable such that ratios of its values are meaningful and for which
there is an inherently defined zero value. Examples: salary, height, weight, time, distance.
- Interval: a quantitative variable such that ratios of its values are not meaningful and for
which there is not an inherently defined zero value. Example: temperature.
Two types of qualitative variables:
- Ordinal: a qualitative variable for which there is a meaningful ordering or ranking of the
categories. Example: good, average, poor, unsatisfactory.
- Nominative: a qualitative variable for which there is no meaningful ordering, or ranking, of
the categories. Examples: gender, colour of car, state of residence.
1.7 Stratified Random, Cluster, and Systematic Sampling
Three sampling designs that are alternatives to random sampling:
- Stratified random sampling: A sampling design in which we divide a population into
nonoverlapping subpopulations and then select a random sample from each subpopulation
(stratum). Strata: the subpopulations in a stratified sampling design. Then a random sample
is selected from each stratum, and these samples are combined to form the full sample. It is
wise to stratify when the population consists of two or more groups that differ with respe ct
to the variable of interest.
- Cluster sampling: A sampling design in which we sequentially cluster population elements
into subpopulations.
- Systematic sampling: a sample taken by moving systematically through the population. We
might randomly select one of the first 200 population elements and then systematically
sample every 200th population element thereafter. In order to systematically select a sample
of n elements without replacement from a frame of N elements, we divide N by n and round
the result down to the nearest whole number. Calling the rounded result l, we then
randomly select one element from the first l elements in the frame – this is the first element
in the systematic sample. The remaining elements in the sample are obtained by selecting
every lth element following the first (randomly selected) element.
1.8 More about Surveys and Errors in Survey Sampling
The target population is the entire population of interest to us in a particular study. The sample
frame is a list of sampling elements (people or things) from which the sample will be selected. It
should closely agree with the target population.
, 1
Classes of survey errors:
- Errors of nonobservation: sampling error related to population elements that are not
observed.
- Erros of observation: sampling error that occurs when the data collected in a survey differs
from the truth. Can be caused by the data collector, the survey instrument, or the data
collection process.
Sampling error: is the difference between a numerical descriptor of the population and the
corresponding descriptor of the sample. The difference between the value of a sample statistic and
the population parameter; it occurs because not all of the elements in the population have been
measured. Under coverage occurs when some population elements are excluded from the process
of selecting the sample.
Chapter 2 – Descriptive Statistics and Analytics: Tabular and Graphical
Methods
2.1 Graphically Summarizing Qualitative Data
When we wish to summarize the proportion (or fraction) of items in each class, we employ the
relative frequency for each class. Percent frequency of a class by multiplying the relative frequency
of 100.
Pareto charts: a bar chart of the frequencies or percentages for various types of defects. These are
used to identify opportunities for improvement.
2.2 Graphically Summarizing Quantitative Data
One rule for finding an appropriate number of classes says that the number of classes should be the
smallest whole number K that makes the quantity 2^k greater than the number of measurements in
the data set. To find the length of each class:
When a histogram a longer right tail than a left tail, the distribution is skewed to the right. When a
histogram has a longer left tail, the distribution is skewed to the left.
Frequency polygon: A graphical display in which we plat points representing each class frequency (or
relative frequency or percent frequency) above their corresponding class midpoints and connect the
points with line segments. Ogive: a graph of a cumulative distribution (frequencies, relative