Statistics 1
Class 1: Population and Sample, Data Collection, Variables and Measurement Scale 3
Statistics 3
Population and Sample 3
Data collection 6
Variables and measurement scale 9
Class 2: Descriptive statistics: Exploratory Data Analysis, Charts, Tables, Measures of Central Tendency and
Dispersion, Spatial Data and Time Series. 12
Descriptive Statistics / Exploratory Data Analysis (h2 & h3) 12
Charts 13
Normality à 15
Tables 16
Measures of Central Tendency 17
Measures of Dispersion 18
Time Series Data and Spatial Data 19
Class 3 & 4: Introduction to SPSS, Probability, Z-score, Central Limit Theorem, Inferential Statistic 21
Introduction to SPSS 21
Inferential Statistics (Ch. 5.3-5.4) 22
Single sample Z-test: introduction to the example (ch. 8.3) 22
Classical Hypothesis Testing (8.1, 8.2) 23
Prob-value/P-value 25
Point estimation (ch 7.1, 7.2, 8.4) 26
Sample Statistics and Sampling Distributions: Z-score and Z-test (5.4, 6.5) 28
Interval estimation (ch. 7.1, 7.3, 8.4) 31
Single sample Z-test: Summary and full example in short (Ch. 8.3) 32
Class 5: Single sample T-test, Normality Test, Significance Vs Relevance 34
T-tests and the T-distribution: introduction (Ch. 8.3) 34
Determining Normality (Ch. 10.4) 37
Single Sample (T-Test) (Ch. 8.3) 38
Significance and relevance (Ch. 8.5) 39
Class 6: Paired Samples T-test, Non-parametric testing, Sign test, Wilcoxon-Signed-Rank-Test 40
Paired Samples T-test: introduction (Ch. 9.2) 40
Repeat- Determining Normality (Ch. 10.4) - 41
Paired-Samples T-test: Results 42
Nonparametric Methods (Ch. 10.1, 10.2) 43
Sign Test (10.2) 43
Wilcoxon-Signed-Rank Test (Notes) 44
Class 7: Two-samples T-test, Levene’s test, Mann-Whitney test, Two-samples Number of Runs test 46
1
,Two-Samples T-test: introduction (Ch. 9.1) 47
Levene’s Test (Ch. 9.4) 48
Back to Two-sample T-test: Result 48
Mann Whitney Test (Ch. 10.2) 50
Two-Samples number of runs test (Ch. 10.2) 51
Which two samples test? 52
Class 8: Inference for proportions: Binomial test, Difference-of-Proportions test 53
Binomial distribution (Ch. 5.3) 53
Binomial Test (Ch. 8.3) 54
Difference of proportions Test (Ch. 9.3) – Do this by hand 55
2
,Class 1: Population and Sample, Data Collection, Variables and Measurement Scale
Statistics
Statistics is the method for collecting, managing, presenting, interpretation and publication and
analyzing of numerical data. Statistics is the methodology used in studies that collect, organize, and
summarize data through graphical and numerical methods, analyze the data, an ultimately draw
conclusions.
Descriptive statistics: Deals with the organization and summary of data. The purpose is to replace
what may be an extremely large set of numbers in some dataset with a smaller number of summary
measures. When this replacement is made, there is inevitably some loss of information. It is
impossible to retail all of the information in a dataset using a smaller set of numbers. The goal of
descriptive statistics is to minimize the effect of this information loss. Another important goal is
understanding which statistical measure should be used as a summary index in a particular case.
Inferential statistics: Descriptive statistics is linked with probability theory so that an investigator can
generalize the results of a study of a few individuals to some larger group.
Population and Sample
Population
Population: The group/the total set of elements that you wish to describe. That can be anything:
people, firms, municipalities, countries, objects, regions, neighborhoods, rivers, etc. If a geographer
is studying farm practices in a particular region, the relevant population consists of all farms in the
region on a certain date or within a certain period. The population under consideration can be finite
or hypothetical.
Population characteristic: Any measureable attribute of an element in the population. Usually, we
are interested in one or more characteristics of the population.
Variable: A variable is a population characteristic that takes on different values for the elements
comprising the population. This is what makes the process of statistical inference necessary.
Population census/enumeration: A way to collect information about a population. It is a complete
tabulation of the relevant population characteristic for all elements in the population. It is a feasible
alternative only for finite populations.
Next to population census/enumeration, Sample is another way to obtain information on a
population.
Sample
Sample: The group for which you have data. A subset of elements from the population, taken with
the intention of making inferences about the certain characteristics of the population as a whole.
Thus, in sampling we obtain values for only selected members of a population.
Inferences: you have a population from where you draw a sample. Based on that sample you make
statements.
We take a sample most of the time for practical considerations. Describing the whole population is
too expensive (time and money), it is impossible (population is too big to find or talk to everybody),
sampling might be destructive, impractical or unnecessary. It is unnecessary because statistics comes
with a number of proves to use a sample and make statements about the population that we have
not seen in its entirety.
3
,Sampling Error: Working with samples rather than the full population has the disadvantage that
restricting our attention to a small proportion of the population makes it impossible to be as
accurate about population characteristics as is possible with a complete census. We have to accept
that there is a degree of uncertainty concerning the match between sample statistics and population
parameters. The risk of making errors is increased: sampling errors. Sampling error is uncertainty that
arises from working with a sample rather than the entire population. It is the difference between the
value of a parameter/population characteristic and the value of that characteristic inferred from a
sample/computed to estimate that parameter. It is the difference between means, something we
want to minimize.
Example: Consider the population characteristic of the average selling price of homes in a given metropolitan
area in a certain year. If each and every house is examined, the average selling price is 150.000. However, if
only 25 homes /month are sampled and the average selling price of the 300 homes in the sample, the average
selling price in the sample may be 120.000. The difference of 30.000 is due to sampling error.
Sampling error is a sum of three processes: variability, sampling bias or nonsampling error.
Variability: The phenomenon whereby repeated sampling from the same population results in
different values for the statistic. There is always a difference between the true population value and
the value you get from your sample. If you draw simple random samples of the same size from the
same populations, the sample statistics will have a (slightly) different value for each sample. There
will always be a difference as a result of coincidence. The sampling outcome is a random variable
because the selection of cases is random. The result will be different as a result of probability. Larger
samples help reduce variability.
Sampling distribution: We know how variability behaves, it results in sampling distribution.
Sampling distribution describes the behavior of statistics and how the statistic varies when
sampling is repeated. It describes how likely it is that you will find something that is bigger or
smaller than before, the (extent of) variability. This is the basis for statistical inference. You can still
make conclusions if you understand how that sample behaves.
Central Limit Theorem: Variability it linked to the Central Limit Theorem. Even if a variable is not
normally distributed in the population, we may assume that under certain conditions, such as a
large number of cases and a fixed standard deviation s, the Sampling Distribution of the mean is
s
approximately normal with standard error . Because we understand how our Sample Statistics
Ö!
behave or what their distribution is, we can rely on one sample to make inferences about the
population.
Sampling Bias: The result of procedures which favour the inclusion of elements from the population
with certain characteristics. Sampling Bias occurs when the procedures used to select the sample tend
to favor the inclusion of individuals in the population with certain population characteristics. As a
result of a faulty sampling approach, the topic research, the approach of the respondent, the
questions asked, etc., some groups are overrepresented and others are underrepresented (due to
nonresponse) in your sample. This introduces bias in the results: the sample is no longer
representative for the larger population. Drawing a larger sample does not mend this situation, it
only serves to increase the probability that you hit upon a representative subset, a true cross section
of the population. Instead, think through the sampling design, the positionality and access to the
population, phrasing of questions, etc. If you do research to university student study habits, a sample
would be biased if it were selected on the basis of interviews of students leaving the university
library late in the evening. It can be minimized by selecting an appropriate sampling plan.
Nonsampling Error: Errors that arise in the acquisition, recording and editing of statistical data. Lack
of recall, ignorance, a less than candid respondent can all play a role here. It manifests itself
4
,predominantly at the level of variables. Minimizing it involves thinking about measurement validity
(do you measure the concept), measure accuracy (the absence of error, or degree of agreement
between measurement and true value , which does not imply validity) and measurement precision
(the range of values possible in the measurement process, or their spread).
Reducing Sampling Error
Variability Sampling Bias Nonsampling Error
Increase n Think through the sampling - Validity, accuracy, precision of variables;
design, the positionality and - Prevent coding errors;
access to the population, - Prevent interpretations errors;
phrasing of questions, etc. - Also: good labelling, metadata
The link between the sample and the population is probability theory. There is no way of knowing
how well a sample reflects the population. So, instead of selecting a representative sample, we select
a random sample. Basing statistical inferences on random samples ensures unbased findings. It is
hard to obtain a very unrepresentative random sample if the sample is large enough. Because the
sample has been randomly chosen, we can always determine the probability that the inferences
made from the sample are misleading. This is why statisticians always make probabilistic judgments,
never deterministic ones.
Representative sample: One in which the characteristics of the sample closely match the
characteristics of the population as a whole.
Random sample: One in which every individual in the population has the same chance, or
probability, of being included in the sample.
The process of statistical inference
Members/units of the population are selected in the
process of sampling. Together these units comprise the
sample. From this sample, whereas inferences about the
population are made. In short, sampling takes us from the
population to a sample, statistical inference takes us from
the sample back to the population. The aim of statistical
inference is to make statements about a population
characteristic based on the information in a sample. There
are two ways of making inferences: estimation and
hypothesis testing.
Statistical estimation: the use of the information in a sample to estimate the value of an unknown
population characteristic. For example, the use political polls to estimate the proportion of voters in
favor of a certain party or candidate. Estimates are the statisticians best guess of the value of a
population characteristic. From a random sample of voters, we try and guess what proportion of all
voters will support a certain candidate.
Hypothesis testing: a procedure of statistical inference in which we decide whether the data in a
sample support a hypothesis that defines the value (or a range of values of a certain population
characteristic). You hypothesize a value for some population characteristic and then determine the
degree of support for this hypothesized value from the data in the random sample.
Parameter or Statistic
Parameter: Numerical property of the population. A mean, frequency or proportion.
Statistic: Numerical property of a sample. A count of something.
5
,Data collection
Sources of data
A properly executed research design will yield data that can be used
to answer the questions of concern in the study. We distinguish
between data that already exist in some form (archival), from data
that we propose to collect ourselves in the course of our research.
Data comes from different sources:
Archival: you have the data
Internal data: data available from existing records or files of an
institution undertaking a study are data from an internal source. The
advantage of this kind of data Is that the researches knows a great
deal about the instruments used to collect the data, the accuracy of
the data and possible errors.
External data: Data obtained from an organization external to the institution undertaking the study
are data from an external source. Many important characteristics of the data are unknown. Caution
should always be exercised in the use of external data. Not every source records all the relevant
information, so users of the data may have the false impression that no anomalies exist in the data.
To be collected: by yourself
Experimental: Some of the factors under consideration are controlled in order to isolate their effects
on the variable or variables of interest.
Survey/nonexperimental: No control is exercised over the factors that may affect the population
characteristic of interest. The five common survey methods are: observation (field study), personal
interview, telephone interview, mail questionnaire (self-enumeration: the individual completes the
questionnaire without assistance from the researcher), web-based.
Sampling Bias
Sampling error: Uncertainty that arises from working with a sample rather than the entire
population. The appeal of statistics is not that it removes uncertainty, but rather that it permits
inference in the presence of uncertainty.
Another reason why the sample may not be representative of a population is sampling bias.
Sampling Bias: Occurs when the procedures used to select the sample tend to favour the inclusion of
elements from the population with certain population characteristics (certain people will have an
incentive to participate). It occurs when the way in which the sample was collected is itself biased.
Example: In the research of university student study habits, a sample would be biased if it were selected on the
basis of interviews of students leaving the library late in the evening. Or, in the research on public transport in
Toronto, only people who have something to complain about will take place in the survey and others won’t.
Sampling Bias occurs for a variety of reasons, usually in combination with each other:
- Population: It might be that some sections of the population are very difficult to get in touch
with
- Researcher: preconceived ideas, laziness, the wrong person for the job.
- Research design: if you are to have a survey on excess to internet, you cannot use a webpage
survey because you miss out on everyone who does not have excess to internet
- Research topic: it might be that the topic is sensitive or leads too strong opinions.
- Respondent: for whatever reasons, he might include your research or not.
In other words, you need to prevent incomplete coverage (missing out on important sections of the
population) or nonresponse. Sampling Bias can usually be avoided or minimized by selecting an
appropriate sampling plan.
6
, Steps in sampling process
If a sample is the only feasible method of collecting the necessary data, the researcher must specify a
sampling plan. It is more a circulated process.
1. Definition of the population: who/what do you want to describe.
2. Construction of a sampling frame: A sampling/population frames is an ordered list of the
individuals in a population. It has two key properties. 1: it must include all individuals in the
population. 2: each individual element of the population must appear once and only once on the
list. Next to these properties, it is useful to distinguish the target population from the sample
population. The target population is the set of all individuals relevant to a particular study. The
sampled population consists of all the individuals listed in the sampling frame. It is desirable to
have the sampled and target population as nearly identical as possible.
Example: If you use telephone surveys for evaluation of voter preferences for political parties, the
population of interest is all eligible voters. The population actually sampled is composed of those residents
with telephones or the one who answer these phones. These two groups overlap a great deal, but are not
exactly the same.
3. Selection of a sampling design: You must decide how you are going to select individuals from the
sampling frame to include in the sample. You can do this with a sample design. There are several
ways to select a sample. A random sample is an extremely useful design in statistical analysis. An
important characteristic of this type of sample is that we know the probability of each individual
in the population of being included in the sample.
4. Specification of the information to Be Collected: This step can usually be accomplished at any
point prior to beginning data collection. The particular format used to collect data must be
rigorously defined and pretested by using a pilot sample. A pilot sample/pretest is an extended
test of data collection procedures to be used in a study in advance of the main data collection
effort. In a field study, the pretest can be used to check instruments, data loggers and all other
logistics. For survey (mail/telephone/interview), the pretest can sometimes reveal deficiencies
for any reason. Thinking ahead about the type of information you need for your research, feeds
back into the population decision (step 1).
5. Collection of data: Once all the problems indicated in the pretest have been successfully solved,
the task of data collection can begin. At this stage,
careful tabulation and editing are particularly
important if we wish to minimize nonsampling error.
Types of samples (step 3: selection of a sampling design)
Sampling bias can be reduced by a sampling design.
Sampling designs can be conveniently divided into two
classes: probability samples and nonprobability samples.
Probability samples: one in which the probability of any individual member of the population being
picked for the sample can be determined. Example: Simple random, independent random,
systematic, stratified, cluster. The advantage of a probability sample is not that there is no
uncertainty in the results, but rather that we can assign a quantitative uncertainty value to the
results.
Cluster: stepwise designs. To minimize costs, you first select randomly a number municipalities
from a given country and within that municipality you start sending in the questionnaires. The
clusters will be the municipality.
Stratified: the active act on the part of researcher by giving someone a higher probability to end up
in the sample. For example, young men, who refuse to take place in questionnaires. So, you send
more to boys just to hope to have enough of them.
Systematic: you’re putting the decision out of your hands
Independent random: you would have wanted a simple random sample but the population is small,
therefore the cases that you select are not independent from the other cases.
7