Week 1: Introduction, Methodology Statistics; Crash-course in elementary statistics
Chapter 1: What is Statistics?
Statistics is a way to get information from data.
o Statistics practitioners: a person who calculates baseball statistics It is a person who uses
statistical techniques properly;
- Financial analyst who develops stock portfolios based on historical rates of return
- Economist who uses statistical models to help explain and predict variables; inflation
- Market researcher who surveys consumers and converts the responses into useful
information
o Statistician; an individual educates in statistical principles. An individual who works with the
mathematics of statistics. His work involves research that develops techniques and concepts
that in the future may help the statistics practitioner. Statisticians ale also statistics
practitioners, frequently conducting empirical research and consulting.
Statistics is divided into two basic areas:
Descriptive statistics: deals with methods of organizing, summarizing, and presenting data in
a convenient and informative way. The actual technique we use depends on what specific
information we would like to extract;
o Graphical techniques that allow statistics practitioners to present data in ways that
make is easy for the reader to extract useful information. (chapter 2&3)
Histogram
o Numerical techniques to summarize data; calculate the average or mean. (chapter 4)
Measure of central location; average, median
Measure of variability; range (between the smallest and largest number)
Inferential statistics: a body of methods used to draw conclusions or inferences about
characteristics of populations based on sample data. A sample is only a small fraction of size
of the population
o Exit polls; random sample of voters are asked what they voted
1.1 Key Statistical Concepts
Statistical inference problems involve three key concepts:
The population; group of all items of interest to a statistics practitioner. A population is a
very large group of people or things. A descriptive measure of a population is a parameter.
The parameter represents the information we need.
The sample; set of data drawn from the studied population. A descriptive measure of a
sample is a statistic. We use statistics to make inferences about parameters.
Statistical inference; the process of making an estimate, prediction, or decision about a
population based on sample data. Conclusions and estimations based on the sample of a
population are not always going to be correct. For this reason there is a measure of reliability
into the statistical inference, there are 2 measures:
o Confidence level; the proportion of times an estimating procedure will be correct;
estimates based on this form of statistical inference will be correct 95?% of the time
o Significance level; measures how frequently conclusion will be wrong, significance
level means that samples that lead to this conclusion will be wrong 5?% of the time
Because population are always very large, investigating each member of the population would be
impractical and expensive. It is easier and cheaper to take a sample from the population of interest
and draw conclusions about the population on the basis of information provided b the sample.
1.2 Statistical Applications in Business
1.3 Large Real Data Sets
,Chapter 2: Graphical Descriptive Techniques 1
Present the principal methods that fall under the heading of descriptive statistics; Different types of
data in statistical applications
Graphical and tabular statistical methods allow managers to summarize data visually to produce
useful information that is often used in decision making.
Descriptive statistics: arranging, summarizing, and presenting a set of data is such a way that useful
information is produced. The method makes use of graphical techniques and numerical descriptive
measures (average) to summarize and present the data, allowing managers to make decisions based
on the information generated.
o The descriptive methods apply to both a set of data constituting a population and a set off
data constituting a sample
When to use each techniques? The two most important factors that determine the appropriate
method to use are: The type of data & The information that is needed
2.1 Types of Data and Information
The objective of statistics is to extract information from data. There are different types of data and
information
Variable: some characteristic of a population or sample. The name of the variable is
presented by uppercase letters (X, Y, Z); the mark
Values: the possible observations of the variable. Which variables are possible? ; The
possible marks 0-100
Data: the observed values of a variable, this is the data from which we will extract the
information we seek. ; the marks from 10 students. Datum: one value of the variable the
mark of one student. There are three types of data:
o Interval data (ratio): quantitative or numerical; real numbers like height, weight,
incomes, and distances
o Nominal data: qualitative or categorical; words that describe categories. We often
record nominal data by arbitrarily assigning a number to each category. It doesn’t
matter in this case which numbers you take, as long each category has a different
number assigned to it, it is valid.
o Ordinal data; nominal, but the order of their values has meaning. The order of the
values of the latter indicate a higher rating. When assigning codes to the values you
should maintain the order of the values. In this case it doesn’t matter which numbers
you use, as long as the order is maintained! Any code that preserves the order of the
data will produce exactly the same result.
The difference between interval and ordinal data is that interval data are consistent and
meaningful; you can calculate the difference and interpret the results. You cannot calculate
and interpret differences at ordinal data, because it are random numbers in order.
Calculations for types of data
o Interval data; all calculations are permitted; calculating the average (chapter 3)
o Nominal data; because the codes are completely arbitrary, you cannot perform any
calculations on these codes. All that we are permitted to do with nominal data is count or
compute the percentages of the occurrences of each category.
o Ordinal data; the only permissible calculations are those involving a ranking process; median
,Hierarchy of data
Interval data > Ordinal data > Nominal data
o Higher-level data types (interval) may be treated as lower-level ones (ordinal/nominal);
convert marks to letter grades/pass or fail. In this case you lose information, you don’t
convert data unless it’s necessary to do so.
o Lower-level data types cannot be treated as higher-level types
Interval Ordinal Nominal
Values are real numbers Values present the ranked Values are arbitrary numbers
order of the data that represent categories
All calculations are valid Calculations based on an Only calculations based on the
ordering process are valid frequencies or percentages of
occurrence are valid
Data may be treated as ordinal Data may be treated as Data may not be treated as
or nominal nominal, but not as interval ordinal or interval
Arithmetic calculations
Variables whose observations constitute the data will be given the same name as the type of data;
interval data are the observations of an interval variable
2.2 Describing a Set of Nominal Data
Graphical and tabular techniques employed to describe a set of nominal data.
The only allowable calculation on nominal data is to count the frequency or compute the percentage
that each value of the variable represents. You can summarize the data in a table;
o Frequency distribution; presents the categories and their counts (bar chart)
o Relative frequency distribution; lists the categories and the proportion with which each
occurs (pie chart)
You can use graphical techniques to present a picture of the data, by using one of the two graphical
methods; This catch a reader’s eye more quickly than a table of numbers
o Bar chart; to display frequencies in a rectangle representing each category, the height of the
rectangle represents the frequency
o Pie chart; shows relative frequencies in a circle subdivided into slices that represent the
categories. Each slice is a proportional to the percentage corresponding to the category.
Because the entire circle is 360º, the percentage of the observation is represented by a slice
of the pie that contains that percentage of 360
The bar chart focuses on the frequencies and the pie chart focuses on the proportions. They are used
to grasp the substance of the data
Ordinal data: there are no specific graphical techniques. When you want to describe a set of ordinal
data, you will treat the data as if they were nominal. The only criterion is that
o the bars in bar charts should be arranged in ascending or descending ordinal values
o the pie charts, the wedges are typically arranged clockwise in ascending or descending order
, 2.3 Describing the Relationship between two Nominal Variables and Comparing two or More
Nominal Data Sets
Univariate: techniques applied to singe sets of data; frequency & relative frequency distribution
Bivariate: methods to depict the relationship between variables
o Cross-classification table (cross-tabulation table): is used to describe the relationship
between two nominal variables and data.
Tabular method of describing the relationship between two nominal variables
1. Produce a cross-classification table that list the frequency of each combination of the values
of the two variables.
2. To see if there is a relationship between two variables, you have to convert the frequencies
in each row or column to relative frequencies in each row or column; compute the row or
column total and divide each frequency by its row or column total.
3. Graphing the relationship between two nominal variables; three bar chart
Comparing two or more sets of nominal data
If the two variables are unrelated the patters exhibited in the bar charts should be approximately the
same, but if there is a relationship, then some bar charts will differ from others.
There are several ways to store the data to be used to produce a table of a bar or a pie chart:
o The data are in two columns; the first column represents the categories of the first nominal
variable, and the second column stores the categories for the second variable. Each row
represents one observation of the two variables. The number of observations in each column
must be the same
o The data are stored in two or more columns; witch each column representing the same
variable in a different sample or population.
o The table representing counts in a cross-classification table may have already been created