Statistics module 1: Data and decisions
1. What are data?
Statistics is about data and decisions, quantities calculated from data. It’s a toolbox and a way of thinking -> its purpose is to gather data that is
relevant to the problem.
Statistics: A collection of tools and the associated reasoning to model, summarize and understand data.
Data: Not just information, also set of values that are measured and observed along with their context.
The “Five W’s”
Who: who is a subject or a case (information) -> who and what are part of data and essential (values)
What: what are variables (information)
Where: where are the values recorded -> why, when, where, how are part of the metadata.
When: when are these values recorded
How: how are these values recorded
Why: why are these values recorded
Big data: Data sets so large that traditional methods of storage and analysis are inadequate.
Data mining -> when companies try to obtain actionable information from data that may have been collected in the course of doing business.
Predictive analysis -> focuses on future performance
Business analysis -> any use of data and statistical analysis to inform business decisions.
Metadata -> information about data
Data warehouses: vast digital repositories where data is recorded and stored.
Data -> latin: givens (Plural).
Data is stored in a table -> datatable.
Relational database -> two or more separate data tables are linked together so that information can be merged across them.
Cases (who?) -> rows of a datatable.
Variables (what?) -> colomns of a datatable
2. Variables types
Categorical: values are names of categories.
Nominal -> just a label without a particular order. Some nominal variables are used as identifiers. Example -> steak: rare, medium,
well done.
Ordinal -> labels with a given order.
Identifier -> purpose is to assign a unique identifier code to each individual. Special case of nominal categorical variables.
Quantitative: values are numerical quantities (sometimes have units) -> variables record measurements, amounts or something else but they
must have units, must be a number quantity and have units.
Cross sectional data -> when you record something at a given points in time for different units. Example -> starbucks locations at the end of
2018.
Time series data -> collects information for 1 subject over different points in time -> if there is displacing, then there is no time-series. Example
-> number of starbucks locations in the world for each year.
Statistics module 2: Displaying and describing categorical Data
3. Displaying and describing categorical data introduction
Descriptive statistics: summarizes data and displays it.
In statistics, instead of using the whole population its best to just take a sample of people from the population and making inferences on that
sample.
Inferential statistics: making inferences on a sample ( chosen randomly).
Display and summarize data to see: patterns, relationships, exceptions in values and observations.
4. Summarizing a categorical variable
IP Number (categorical, ordinal) Time (quantitative)ll Source (categorical, nominal)
245.240.221.71 1/feb/2013 at 13:15:08 Google
196.345.281.51 1/feb/2013 at 14:56:23 direct
Frequency table: records the counts for each of the categories of a categorical variable. How often a variable occurs.
Absolute frequency -> Count of the number of cases in each category.
Relative frequency -> count of the number of cases in each category divided by the total number of cases.
Source Absolute frequency (count) Relative frequency (%)
Google 130158 57.36
Direct 52969 23.34
…. …. ….
Total 226925 100%
Express frequency as a percentage: Compute the proportion: 130158/226925 ≈ 0.5736
Percentage = proportion x 100% -> 0.5736 x 100% = 57.36%
, 5. Displaying a frequency table
Bar chart -> Displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.
keep it proportional, categorical bar cart must have gaps to indicate different categories, label the axes.
Pie chart: shows how the whole group breaks into several categories and shows all the cases as a circle sliced into pieces whose areas are
proportional to the fraction of cases in each category.
Pie chart gives less information and its really difficult to see that the story is. Bar charts are better to represent frequency tables.
6. 2 categorical variables
Example: survey of 5039 people in 5 countries: “Do you use social network sites”
Data table:
Respondent ID (identifier) Social networking (categorical) Country (categorical)
0001 Yes Egypt
0002 No access Egypt
…. …. ….
5039 Yes US
Frequency table:
Social networking count Relative frequency
No 1249 24.79%
Yes 2175 43.16%
No access 1615 32.1%
Total 5039 100.1%
7. Contingency table
List of all possible outcomes for each of the two categorical variables. Shows how individuals are distributed along variables depending on the
value of the other variables.
GB EG DE RU US Total
No 326 70 460 90 293 1249
Yes 529 300 340 500 506 2175
No access 153 630 200 420 212 1615
Total 1018 1000 1000 1010 1011 5039
Marginal distribution: the frequency distribution of either one of the variables
Each cell of the contingency table gives the count for a combination of values of the 2 variables.
For every cell, we can compute three percentages
Total percentage = 300/5039 x 100% ≈ 6.0% -> 6.0% of the total number of respondents are from Egypt and answered Yes
Row percentage = 300/2175 x 100% ≈ 13.8% -> 13.8% of the total number of respondents who answered Yes are from Egypt.
Column percentage = 300/1000 x 100% ≈ 30.0% -> 30.0 % of the total respondents answered Yes to the survey question.
Compare social networking country to country -> use column percentages.
Statistics Module 3: Displaying and describing quantitative data
1. Displaying quantitative Variables.
Quantities you can count:
Quantities that can be measured: continuous
E.g. Monthly Stock price ($), AIG 2002-2007.
Month AIG stock price
Jan 2002 $77.26
Feb 2002 $72.95
Mar 2002 $73.72
…
Nov 2007 $56.86
Dec 2007 $58.13
How are the values of price distributed? -> construct a frequency table
To construct a frequency table:
1) Sort the values from small to large
2) Decide which bins you will use: e.g. -> $45-$50, $50-55, $55-60, $60-65, $65-70, $70-75, $75-80
The values that fall on the boundaries -> e.g. bin $45-50 includes 45 but not 50: left closed, right open: [$45, $50[
3) Count the number of cases that fall into each bin:
AIG stock price Count (absolute frequency) Relative frequency Density (per $)
$45-$50 2 2.8% 0.0056
$50-55 3 4.2% 0.0084
, $55-60 13 18.1% 0.0362
$60-65 16 22.2% 0.0444
$65-70 24 33.3% 0.0667
$70-75 13 18.1% 0.0362
$75-80 1 1.4% 0.0028
Total 72 100.1% /
Before making a histogram, you should check the quantitative condition: the data must be values of quantitative variable whose units are
known. A bar chart and a histogram look similar, there not the same. You can’t display categorical data in a histogram, histograms don’t have
gaps, bar charts do.
Bar chart of a frequency table of quantitative data: Histogram / Bar chart of relative frequency table of quantitative data: a relative frequency
histogram
When you look at a histogram, look for four characteristics:
1) Shape: symmetry vs skew, Bumps and valleys, gasps
2) Center
3) Spread.
2. Data density
Density histogram: area of each bar represents relative frequency. 3.62%/$ Area
Area = height x width =18.1%
Relative frequency = height x width of bin.
Height = relative frequency/width of bin
E.g. for $55-$60 = 18.1%/$5 = 3.62% per $ in the $55-$60 bin, every interval of $1 wide contains about 3.62% of the values. $55
$60
Density is usually expressed as decimal fraction per horizontal unit: 3.62%/$ = 3.62/100 per $ = 0.0362/$;
Density histogram:
3. Shape
You should pay attention to 3 things: shape, center and spread.
Mode:
o A single mode (e.g. the bin $65-$70): unimodal distribution -> the IQR should be bigger than standard deviation, if not check again if
the distribution isn’t skewed or multimodal.
o two modes: bimodal distribution
o no clear modes: uniform distribution
o three or more modes: multimodal.
Symmetry: when the distribution is symmetric. When you can fold the histogram in the middle and the 2 sides almost match (mirror images)
o Skewed to the left: left side has longer “tail”
o Skewed to the right: right side has longer “tail”
Outliers: values that stick out -> they tell us something interest about the data, can be the most informative part of your data
Centre: What is the typical stock price? -> about $65 -> if we want more precise number -> calculate the average (mean).
Mean(average) = sum of all values
How many values there are
The mean is sensitive to skewness. If its right skewed the mean will be bigger than the median, if its left skewed the mean will be smaller than
the median.
Adolphe Quetelet (1796-1874): inventor of the average.
Median = the value that splits the histogram into two equal areas, used for variables like cost or income (likely to be skewed), median is
resistant to unusual observations. Median is better choice for skewed data because its resistant to outliers.
1) Order the values