INTRODUCTION
All statistical methods require data. Data is the facts about the world that one seeks to study
and explore.
Data – summarized or unsummarized [raw].
WHAT IS STATISTICS
Statistics is the collection of methods that allow one to work w. data effectively.
Stats is a tool to obtain information from data. It provides us w. formal basis to summarize and
visualize data, reach conclusions about the data, make reliable predictions about business
activities and improve business process.
DCOVA framework
Define data you want to study to meet an objective.
Collect the data from appropriate sources.
Organize data collected by developing tables.
Visualize data by developing charts.
Analyze data collected, reach conclusions and present results.
BUSINESS ANALYTICS
Combines statistical methods w. management science and information systems to form an
interdisciplinary tool that supports fact-based decision making.
DATA SCIENCE
The field of study that combines domain expertise, programming skills and knowledge of
mathematics and statistics to extract meaningful insights from data.
BIG DATA
A collection of data that cannot be easily browsed or analyzed using traditional methods.
Is data being collected in huge volumes, at very fast rates [real time] and in variety of forms.
It may refer to large data sets of structured data stored in files / worksheets. May be
unstructured such that the data has an irregular pattern and contain values that are not
comprehensible without further interpretation [unstructured data could be text, pictures,
videos or audio].
DEFINITIONS
Descriptive statistics: Variable: characteristic /
methods of organizing, property of an item that
summarizing, and presenting can vary among the
data in an informative and occurrences of those items.
convenient way. [Note: each value for a
variable is a single fact –
Inferential statistics: methods not a list of facts].
used to make a conclusion
about a characteristic of a Data: set of values
population, based on a associated w. one / more
smaller sample of the variables.
population.
Statistics: methods that
analyze the data of the
variables of interest.
,CLASSIFYING VARIABLES BY TYPE
Categorical [qualitative] variables:
- Take categories as their values [e.g. “yes” / “no”].
Numerical [quantitative] variables:
- Have values that represent a counted / measured quantity.
o Discrete variables arise from a counting process. Values are countable over a
finite range.
o Continuous variables arise from a measuring process. Values are uncountable
over a finite range.
MEASUREMENT SCALES
Nominal scale – classifies categorical data into distinct categories in which no ranking is
implied.
Ordinal scale – classifies categorical data into distinct categories in which ranking is implied.
Numerical variables use an interval scale or ratio scale
- Interval scale: ordered scale in which the difference btwn. measurements is a
meaningful quantity but the measurements do not have a true zero point.
- Ratio scale: ordered scale in which the difference btwn. measurements is a
meaningful quantity and the measurements have a true zero point.
Variables
Categorical Numerical
Ordinal Nominal Discrete Continuous
E.g. Ratings; Good, E.g. Marital status / E.g. Number of
Better, Best eye colour children / defects E.g. Weight / time
[ordered [defined per hour [counted [measured
categories] categories] items] characteristics]
POPULATION VS SAMPLE
Data is collected from a population / sample.
Population:
- Contains all items / individuals of interest that you seek to study / about which you
want to reach conclusions.
Sample:
- Contains only a portion of a population of interest.
- Use because:
o Less time consuming than selecting every item in population.
o Less costly than selecting every item in population.
o Less cumbersome and more practical than analyzing entire population.
- Analyzed to estimate characteristics of an entire pop.
o Population parameter summarizes the value of a specific variable for sample
data.
o Sample statistic summarizes value of a specific carriable for sample data.
o Sample statistics are used to estimate population parameters.
OBSERVATIONAL STUDIES AND DESIGNED EXPERIMENTS
Have a common objective.
- Both attempt to quantify the effect that a process change [called a treatment] has
on a variable of interest.
In observational study, no direct control over which items receive treatment.
,In designed experiment, is direct control over which items receive treatment.
SOURCES OF DATA
Primary sources: Secondary sources:
- Data collector is one using data for - Person performing data analysis is
analysis: not data collector:
o Data from political survey. o Analyzing census data.
o Data collected from o Examining data from print
experiment. journals / data published on
o Observed data. internet.
SOURCES OF DATA ARISE FROM
Capturing data generated by ongoing business activities.
Distributing data compiled by an organization / individual.
Compiling the responses from a survey.
Conducting a designed experiment and recording the outcomes.
Conducting an observational study and recording results.
SAMPLING PROCESS
Begins w. sampling frame:
- Sampling frame is a listing of items that make up pop.
- Frames are data sources.
- Inaccurate / biased results can result if frame excludes certain groups / portions of
pop.
- Using different frames to generate data can lead to dissimilar conclusions.
TYPES OF SAMPLES
Samples
Non Probability Probability
Samples Samples
Simple
Judgement Convenience Systematic
Random
Stratified Cluster
Non Probability sample
Items included are chosen without regard to their probability of occurrnce.
- In Convenience sampling, items are selected based only on fact they are easy,
inexpensive or convenient to sample.
- In Judgement sample, get options of pre-selected experts on subject matter.
Probability sample
Items in sample are chosen on basis of known probabilities.
Simple Random sample
- Every individual / item from frame has equal chance of being selected.
- Selection may be w. replacement [selected individual is returned to frame for
possible reselection] or w. out replacement [selected insividual is not returned to
frame].
- Samples obtained from table of random numbers / comp random number
generators.
, Systematic sample
- Decide on sample size : n.
- Divide frame of N individuals into groups of k individuals : k = N / n.
- Randomly select one individual from 1st group, i.e. choose a sample btwn 1 and k.
- Select every kth individual thereafter [e.g. if you choose 2 and k=10 then it will be 2,
12, 22, etc.].
Stratified sample
- Divide pop into two / more subgroups [called strata] according to some common
characteristic.
- Simple random sample is selected from each subgroup, w. sample sizes proportional
to strata sizes.
- Samples from subgroups are combined into one.
- This is common technique when sampling population of voters, stratifying across
provincal / socio-economic lines.
Cluster sample
- Population is divided into several “clusters”, each representative of the pop.
- Simple random sample of clusters is selected.
- All items in selected clusters can be used, or items can be chosen from a cluster using
another probability sampling technique.
- Common application of cluster sampling involvers election exit polls, where certain
election districts are selected and sampled.
Comparing sampling methods
- Simple random sample and systematic sample:
o Simple to use.
o May not be good representation of populations underlying characteristics.
- Stratified sample:
o Ensures representation of individuals across entire population.
- Cluster sample:
o More cost effective.
o Less efficient [need larger sample to acquire the same level of precision].
Selection w. probability propertionate to size
- In cases of random sample, elements of population are selected without monetary
value on invoice playing a role [e.g. if we consider sales, an invoice w. value R10 has
same probability of being selected as invoice w. value R100].
- If correctness of monetary value must be verified, the magnitude of monetary value
becomes important.
- In such a case a selection process that takes the magnitude of monetary values on
each invoice into account is preferred.
- Refer to this type of selection process as selection proportional to size [PPS], where
size refers to monetary value on each invoice.
- Suppose several invoices must be selected from N invoices via PPS selection process.
- Let T denote the total monetary value of the N invoices.
- According to PPS selection process, each of T rand units has an equal probability of
being selected.
- This implies invoice w. R4 000 entry has a probability of selection four times as large as
selection probability of an invoice w. R1 000 entry.
- In this case, dealt w. two types of elements [invoices and rand units].
- W. PPS selection an invoice is selected in an indirect manner, because a rand unit is
selected first and then the invoice on which it occurs is selected.
- Note that each rand unit ahs same chance of selection, but chance of selection for
each invoice is proprtionate to number of rand units that appears on it.
Example