Chapter 1 – Statistics, Data and Statistical Thinking
1.3 Fundamental Elements of Statistics
Statistical methods are particularly useful for studying, analysing, and learning about populations of
experimental units.
An experimental unit is an object (e.g., person, thing, transaction, or event) about which we collect
data.
A population is a set of all units (usually people, objects, transactions, or events) that we are interested
in studying. Example: all working people in USA
In studying a population, we focus on one or more characteristics or properties of the units in the
population. We call such characteristics variables.
A variable is a characteristic or property of an individual experimental (or observation) unit in the
population. Age, gender, years of education etc.
Measurement is the process we use to assign numbers to variables of individual population units.
Census of the population is when measuring a variable for every unit of a population.
Sample is a subset of the units of a population.
1.4 Types of Data
All data can be classified as one of two general types:
- Quantitative data are measurements that are recorded on a naturally occurring numerical scale.
o The temperature at which each piece in a sample begins to melt
o The current unemployment rates
o The number of convicted murderers who receive the death penalty
- Qualitative data are measurements that cannot be measured on a natural numerical scale; they
can only be classified into one of a group of categories.
o A taste tester’s ranking of four brands of sauce (best, worst)
o The political party affiliation in a sample of 50 voters (Democrat, Republican)
o Closed at night (yes or no)
,Chapter 2 – Methods of Describing Sets of Data
2.1 Describing Qualitative Data
Class is one of the categories into which qualitative data can be classified.
Class frequency is the number of observations in the data set that fall into a particular class.
The class relative frequency is the class frequency divided by the total number of observations in the
data set. (Class frequency / n)
The class percentage is the class relative frequency x 100
Graphical Descriptive Methods for Qualitative Data
- Bar graph: the categories of the qualitative variable are represented by bars, where the height
of each bar is either the class frequency, class relative frequency or class percentage
- Pie Chart: the categories (classes) of the qualitative variable are represented by slices of a
pie. The size of each slice is proportional to the class relative frequency.
- Pareto Diagram: A bar graph with the categories (classes) of the qualitative variable
arranged by height in descending order from left to right.
2.2 Graphical Methods for Describing Quantitate Data
To describe, summarize and detect patterns in such data, we can use three graphical methods:
1. Dot plots: The numerical value of each measurement in the data set in
located on the horizontal scale by a dot. When data values repeat, the dots are
placed above another.
2. Steam-and-leaf display: The stem is the portion of the measurement to the
left of the decimal point, while the remaining portion, to the right of the decimal point,
is the leaf. The stems for the data set are listed in the second column. Then the leaf for
each observation is listed to the right.
3. Histograms. The possible numerial values of the quantitative variable are
partitioned into class intervals, each of which has the same width. These intervals from
the scale of the horizontal axis. The frequency or relative frequency of observations in
each class interval is determined. A vertical bar is placed over each class interval, with
the height of the bar equal to either the class frequency or class relative frequency.
2.3 Numerical Measures of Central Tendency
The mean of a set quantitative data is the sum of the measurements, divided by the number of
measurements contained in the data set. We denote the mean of a sample of measurements by 𝑥𝑥̅ . For
the mean of a population, we use a different symbol: µ
The median of a quantitative data set is the middle number when the measurements are arranged in
ascending (or descending) order. We denote the median of a sample of measurements, by M. For the
median of a population, we use a different symbol: η
A data set is said to be skewed if one tail of the distribution has
more extreme observations than the other tail.
The mode is the measurement that occurs most frequently in the
data set.
The measurement class containing the largest relative frequency is
called the modal class.
2.5 Numerical Measures of Variability
Range is equal to the largest measurement – the smallest
measurement.
Deviation is the distance between each measurement and the mean.
, The sample variance is for a sample of n measurements equal to the sum of the squared deviations
from the mean divided by n. The symbol 𝑠𝑠 2 is used to represent the sample variance.
The sample standard deviation, s, is defined as the positive square
root of the sample variance 𝑠𝑠 2 S = √ S²
s² = sample variance (sample variance)
s = sample standard deviation
σ = population standard deviation
σ² = population variance
Chebyshev rule applies to any data set, regardless of the shape of the frequency distribution of the
data.
Empirical rule: is a rule of thumb that applies to data sets with frequency distributions that are mound
shaped and symmetric.
Chebyshev’s rule Empirical rule
(µ-σ, µ+σ) At least 0% ≈68%
(µ-2σ, µ+2σ) At least 75% ≈95%
(µ-3σ, µ+3σ) At least 89% ≈All
2.6 Numerical Measures of Relative Standing
Pth percentile is a number such that p% of the measurements fall below that number and (100-p%) fall
above it.
Percentiles that partition a data set into four categories, each category containing exactly 25% of the
measurements are called quartiles.
The lower quartile (𝑄𝑄𝐿𝐿 ) – 25% of the data
The middle quartile (M) – the median of 50%
The upper quartile (𝑄𝑄𝑈𝑈 ) – 75% of the data
Z-score gives the relative location of the measurement.
Interpretation of z-scores for Mound-Shaped distributions of data
1. Approximately 68% of the measurements will have a z-score between the -1 and 1
2. Approximately 95% of the measurements will have a z-score between the -2 and 2
3. Approximately 99.7% of the measurements will have a z-score between the -3 and 3
2.7 Methods for Detecting Outliers: Box plots and z-scores
Outliers: is an observation that is unusually large or small relative to the other values in a data set.
Outliers typically are attributable to one of the following causes:
1. The measurement is observed, recorded, or entered into the computer incorrectly
2. The measurement comes from a different population
3. The measurement is correct but represents a rare event
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller nienkevanleeuwe. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $6.52. You're not tied to anything after your purchase.