Population = the set of objects under investigation. Objects themselves are the elements.
Data (measurements) = made on the elements and reflect some individual characteristics of the
elements.
Sample = studies consider only a part of the population of interest (a sample). The data, than
are measurements of the elements in the sample. These data contain hidden information that
has to be detected by statistics in order to become knowledge.
Data have to be collected first, then the data have to be summarized in terms of informative
numbers. Often measurement comes from a sample.
Subdivision of statistics:
1. Descriptive statistics: includes the collecting of data, and summarizing and presenting
them by means of tables, graphs and distinctive numbers. Collecting data involves
observations that follow from experiments. Data must be summarized (assure no
important information is loosed).
2. Probability theory: studies the behaviour and the laws of chance and probability in
experiments that allow more than one outcome. Precision of statistical procedures is
always expressed in terms of probabilities
3. Sampling theory: studies methods of sampling and their properties. One method is
random sampling, where the elements of the population have the same chance of being
chosen in the sample.
4. Inferential statistics: studies and applies methods to draw conclusions about distinctive
numbers of the whole population of interest by considering only a sample.
Variables
- Characteristic = a feature of interest that is used to compare the elements.
- Population variable = a well-defined prescript for observing a characteristic.
- Observation/observed value (or data) = when a variable is measured at an element.
- Qualitative (categorical) variables = a variable with categorized values.
o Nominal variable = if the values cannot be ordered in a natural way (e.g. gender)
o Ordinal variable = if the values can be ordered naturally (size)
- Quantitative (numerical) variable = values that are ordinary numbers. Can be:
o Discrete = if its set of possible values can be counted or;
o Continuous = if the set of possible values consists of all real numbers in an
interval (concerns large numbers).
- Alternative/dichotomous variables = qualitative variables that can take only two values
(man or woman for instance)
- Dummy variable = if one of the two values of an alternative variable is coded 1 and the
other as a 0
- Interval variable = If the ratio of two values of a quantitative variable is meaningless.
Otherwise it’s a ratio variable
Populations versus samples
- Census = when a variable is observed at all elements of the populations
- Population dataset = the resulting dataset from census; it contains all possible info
- Sample = the subset of a population
1
, - Sample dataset = the resulting dataset from a sample
- Sample statistics / statistics = if, for a certain variable, the dataset is a sample dataset
- Population statistics = if the dataset is a population dataset
- Parameters = a number of other measurable factor forming one of a set that defines a
system
- A statistic measures some overall feature of a set of objects (fixed number)
- A variable measure some individual feature that can take different values with different
individual objects
Chapter 2: Tables and graphs
Nominal variables
- (absolute) frequency = the number of times that a certain value occurs in the dataset
- Relative frequency = divided frequency of a value by the total number of observations.
The proportion of all observations in the dataset with that value times 100
- Frequency distribution = overview of all different values in the dataset jointly with
accompanying frequencies
Ordinal variables
- Cumulative variables = values that can be put in increasing order
- Cumulative (relative) frequency distribution = overview of all different values
combined with the respective cumulative (relative) frequencies
Quantitative variables
- Discrete variable = each different value forms a class if there are not too many
- Continuous variable = the classes are usually adjoining intervals
(Cumulative) distribution for a discrete variable
The cumulative distribution function (or distribution function for short) of a dataset of
observations of a discrete variable is the function F such that, for all real number b:
F(b) = relative frequency of the observations ≤ b
Properties of the distribution function for a discrete variable
- It is a non-decreasing step function
- It jumps to higher vertical levels at the different values of the dataset
- The jump sizes are just the relative frequencies of the different values in the dataset
Data of continuous variables
- Categorical system = the classes do not have common values and cover a whole range
- Classification = denoted data in a categorical system
- Classified frequency distribution = frequency distribution that gives an overview of the
chosen classification and the respective frequencies
2
,Cumulative distribution function when the variable is continuous.
In this case F(b) is the proportion of the observations that according to the classified frequency
distribution is smaller or equal to b.
If b is the upper bound of a class in the classification, then F(b) is just the cumulative relative
frequency up to and including that class. If b is smaller than the lower bound of the first class
in the classification or larger than the upper bound of the last class, then it is clear that F(b)
respectively equals 1 or 0.
When F(b) is defined for b in a class (l,u) of the classification with lower bound l and upper
bound u, note that F(u) is just the cumulative relative frequency up to and including that class
while F(l) is the cumulative relative frequency up to and including the preceding class in the
classification. Hence, F(u) – F(l) is the relative frequency for the class (l,u). For b in a class (l,
u) value F(b) of F at b follows by putting pairs (l,F(l)) and (u,F(u)) as dots in two-dimensional
system of axes, connecting them with a straight line and using this line to define F(b).
Figure: linear interpolation
(cumulative) distribution function of a classified frequency distribution
The (cumulative) distribution function F of a classified frequency distribution is the function
that arises from the cumulative relative frequencies of the lower and upper bounds of the classes
in the classification by using the method of linear interpolation.
Property: F is not a step function by a continuous and non-decreasing function that on each
class (l, u) of the classification goes from F(l) to F(u) by way of a straight line.
Time series data
- Cross-sectional data = most datasets so far have been measurements made at one
moment in time
- Time series data = measurement of a single variable at successive periods or moments
in time. The sequence of successive data is called time series.
3
, Chapter 3: Measures of location
Statistics mean value and percentage overweight are of interest; they help summarize the
dataset. Location refers to some central position of the dataset and its distribution but is not yet
defined precisely. Examples of measures of location are mode, median and mean.
Nominal variables
- Mode / modal value = the value within the dataset that has the highest frequency
o Unimodal = if dataset has one mode value
o Bimodal = if dataset has two modal values
o Multimodal = If dataset has more than two modal values
Ordinal variables
- Median = if the value of the data points is odd, the value of the middlemost observation
of the ordered data. If the number of data points is even, the middlemost pair of the
ordered data can be determined.
o If the dataset is a population dataset, its median is denoted by µmedian
o If the dataset is a sample dataset, its median is denoted by χmedian
Quantitative variables
- Median = (Xm1 + Xm2) / 2. If the dataset is even and quantitative.
Arithmetic mean
Proportion of successes
For qualitative variables the mean of a dataset is not defined, but for a quantitative variable is
defined. It is shown that the mean of observations of a 0-1 variable does make sense and is
equal to the proportion of ones in the dataset.
Weighted mean
Geometric mean
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller markoverkamp. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $4.13. You're not tied to anything after your purchase.