Statistics for Pre-MSc 2024-2025
(EBS027A05)
The purpose of this document is to help you study for the exam along side the knowledge clips,
not to replace them.
,Contents
Knowledge clips ........................................................................................................................... 4
Clip 1 (The role of statistics and data) ................................................................................................ 4
Clip 2 (Descriptive Statistics: Tables and Figures) ............................................................................... 7
Clip 3 (Descriptive Statistics: numerical measures) ............................................................................ 9
Clip 4 (Descriptive Statistics: Relations between two variables) ...................................................... 13
Clip 5 (Inferential Statistics Probabilities and Distribution) .............................................................. 16
5A. Probabilities ............................................................................................................................ 16
5B. Distribution ............................................................................................................................. 20
Clip 6 (Inferential Statistics) .............................................................................................................. 23
6A. Estimation ............................................................................................................................... 23
6B. Hypothesis testing .................................................................................................................. 26
SPSS data interpretation ............................................................................................................ 28
Descriptive Statistics Output............................................................................................................. 28
Chi-Square tests ................................................................................................................................ 29
Person correlations coefficient ......................................................................................................... 30
Independent sample T-Test .............................................................................................................. 31
ANOVA (Analysis of Variance) ........................................................................................................... 32
Regression analysis ........................................................................................................................... 33
List of equations to know before the exam from BS .................................................................... 37
List of the actual equations to know ........................................................................................... 38
List of equations with examples: ................................................................................................ 40
1. Basic Statistics (Mean, Median, Mode) ........................................................................................ 40
2. Interquartile Range (IQR) .............................................................................................................. 40
3. Standard Deviation and Variance ................................................................................................. 40
4. Probability and Events .................................................................................................................. 41
5. Binomial Distribution .................................................................................................................... 41
6. Uniform Distribution ..................................................................................................................... 42
7. Standardization (Z-Score).............................................................................................................. 42
8. Sampling and Confidence Intervals............................................................................................... 42
9. Hypothesis Testing ........................................................................................................................ 42
,Course Learning goals
1. Translate a stylized problem into a statistical problem
2. Solve the translated problem by applying the appropriate theory
3. Compare and contrast samples from different populations, using parametric or non-
parametric tests
4. Test if data meet the required assumptions of the statistical models used
5. Use SPSS for descriptions and inference
In both applied and theoretical business research, statistical methods and techniques play an
important role. The primary objective of the course is to lay the foundation for correct and relevant
use of statistical methods. In order to achieve this, attention will be paid to (1) the technical skills
required for data analysis, and (2) the assessment and interpretation of results from statistical
analysis.
,Knowledge clips
Clip 1 (The role of statistics and data)
“Statistics is the art and science of collecting, analyzing, presenting and interpreting data”
Important terminology in the statistics course:
• Database or data set : total number of information or numbers in a sheet
• Columns : variables ( The vertical axis)
• Rows : observations or cases (the horizontal axis)
• Each individual cell : a measurement or a data point
There are four types of variables:
1. Nominal variables are the simplest form of data and are used for labelling variables without
any quantitative value. They categorize data without a natural order or ranking.
• Example: Gender (male, female), types of cuisine (Italian, Chinese, Mexican).
2. Ordinal variables represent categories with a meaningful order or ranking, but the intervals
between the ranks are not necessarily equal.
• Example: Education level (high school, bachelor’s, master’s, doctorate), customer
satisfaction ratings (satisfied, neutral, dissatisfied).
3. Interval variables are numerical scales where intervals between values are meaningful and
consistent, but there is no true zero point.
• Example: Temperature in Celsius or Fahrenheit. The difference between 20°C and 30°C is
the same as between 30°C and 40°C, but 0°C does not represent the absence of
temperature.
4. Ratio variables are similar to interval variables, but they have a meaningful zero point,
indicating the absence of the variable being measured, allowing for the calculation of ratios.
• Example: Height, weight, and age. For instance, a weight of 0 kg means no weight, and a
person who is 50 kg is twice as heavy as someone who is 25 kg.
,The 3 types of datasets:
1. Cross-sectional data refers to data collected at a single point in time or over a very short
period. Each observation represents a different individual, entity, or unit, but all are observed
simultaneously or within the same period.
• Example: Imagine you survey 1,000 households to collect information on their income,
education level, and household size. If all the data are collected at the same time, this is
cross-sectional data.
• Usage: Cross-sectional data is often used in regression analysis to identify relationships
between variables at a particular point in time. It’s useful for analyzing and comparing
different groups or entities, such as comparing income levels across different regions.
2. Time series data consists of observations collected sequentially over time, usually at
consistent intervals (e.g., daily, monthly, quarterly). Each observation represents the value of
a variable at a specific time point.
• Example: If you track the daily closing price of a stock over a year, you would have a time
series dataset, where each observation corresponds to the stock price at the end of each
trading day.
• Usage: Time series data is primarily used to identify trends, seasonal patterns, and cycles
over time. Analysts often use this data for forecasting, such as predicting future stock
prices, economic indicators, or weather patterns.
3. Panel data is a combination of cross-sectional and time series data. It involves collecting data
on the same subjects (individuals, firms, countries, etc.) at multiple points in time.
• Example: Suppose you survey the same 1,000 households every year for five years,
collecting information on their income, education level, and household size each time. The
resulting dataset would be panel data because it includes multiple observations over time
for each household.
• Usage: Panel data allows for more complex analyses, such as examining how variables
change over time within the same entity and how these changes differ across entities. It’s
particularly useful for controlling for unobserved heterogeneity, as it accounts for
individual-specific effects that are constant over time.
Sources of data:
Primary Data refers to data that is collected firsthand by the researcher for a specific purpose or
research question. This data is original and unique to the study at hand, meaning it has not been
previously collected or published.
Secondary data refers to data that has already been collected, processed, and possibly analyzed by
someone else, often for a different purpose. Researchers use secondary data to support or
supplement their research without collecting new data.
,Key statistical concepts
A population is the group of all items/ cases of interest. If you want to draw a conclusion about this
group and you are unable tot study the whole population you have to take a sample. A sample is a
group of items/cases drawn from a population. To get the conclusion, you apply statistical analysis on
the data from the sample. The larger the sample the higher the accuracy.
Two types of statistical analysis:
Descriptive statistics (knowledge clips 2 & 3) are a branch of methods for summarizing and organizing
the information in a data set. Descriptive statistics provide simple summaries about the sample and
the measures. The key point is that descriptive statistics describe and summarize the data without
making any inferences or predictions about a larger population.
(knowledge clips 4 – 9) In contrast, inferential statistics go beyond simply describing the data. They
are used to make inferences or generalizations about a population based on a sample of data drawn
from that population. The goal of inferential statistics is to make predictions or decisions about a
population based on sample data, accounting for the randomness and uncertainty inherent in
sampling.
In summary, while descriptive statistics summarize the
characteristics of a data set, inferential statistics use sample
data to make generalizations about a larger population.
,Clip 2 (Descriptive Statistics: Tables and Figures)
Categorical variables
Categorical variables represent distinct categories or groups within a dataset, such as gender, types of
products, or regions. These variables don't possess a meaningful numeric scale but rather define
classifications. The appropriate presentation of categorical data involves choosing between tabular
and graphical formats depending on the clarity and purpose of the data display.
Tables
• Frequency distribution: Categorical data is often summarized by showing how frequently each
category appears, which is termed a frequency distribution. This approach helps readers
understand the distribution of categories within the dataset.
• Shown in a table: Frequency distributions are commonly presented in tables where each category
is displayed alongside its frequency count or percentage.
• Absolute & relative frequency (percentages): Tables may show both absolute frequencies
(counts) and relative frequencies (percentages), allowing for easy comparison between
categories.
• Cumulative frequencies: For categorical data where categories are ordered (like age groups),
cumulative frequencies may also be presented to show the cumulative totals up to each category.
• Grouping smaller categories: To maintain clarity, it’s advised not to create too many categories.
If certain categories have very few entries, they can be combined under an "Other" category,
making the table more readable.
Figures
• Bar charts: Bar charts are often used to visually display categorical data. Each bar represents a
category, and the height corresponds to the frequency or percentage of that category. Bar charts
can be oriented horizontally or vertically, depending on space and readability.
• Frequency Labels: Bar charts can be presented with or without exact numerical labels, such as
frequencies or percentages, depending on the level of detail required.
• Pie Charts: Pie charts are sometimes used to represent categorical data but are often discouraged
due to difficulties in interpreting precise proportions visually. They are useful mainly when the
distribution of categories as a percentage of the whole is of primary interest.
In a report you should always make a balance between tables and figures. For simplicity and
consistency, it is recommended to use only one type of display—either a table or a figure—for
presenting the same data, as multiple formats can be redundant and potentially confusing.
,Numeric variables
Numeric variables, also known as quantitative variables, represent data with measurable quantities
and can be further divided into continuous or discrete types. Here is a detailed explanation of the key
points for representing numeric variables:
• Frequency distribution: This shows how often each value or group of values occurs within the
dataset. For numeric data, frequencies are often grouped into ranges or intervals to create a
summary that’s easier to interpret, especially for large datasets.
• Absolute & relative frequency (percentages):
o Absolute frequency refers to the count of occurrences within each interval.
o Relative frequency represents the proportion of occurrences in each interval, often
expressed as a percentage. Using percentages allows for easier comparison, especially
across different sample sizes.
• Cumulative frequencies (more relevant):
o Cumulative frequency adds up the frequencies from each interval, giving a running total.
This measure is particularly relevant for numeric data because it shows the number of
observations up to a particular value or threshold.
o Cumulative percentage shows the proportion of data points that fall below a certain
threshold, making it easier to understand distributions at a glance (e.g., 35.8% of people
are below 30 years).
• Recoding and Grouping:
o For datasets with a wide range of numeric values or too many categories, it’s useful to
combine values into groups (e.g., age groups like 20-29, 30-39) to make the data easier to
analyze and interpret.
o It’s important to create groups of equal size wherever possible (e.g., 10-year age
intervals), as this consistency simplifies comparison and interpretation.
• Equal Distance Between Thresholds: When creating intervals for numeric data (e.g., ages 10-20,
20-30), maintaining an equal distance between thresholds helps in interpreting the data and
creating consistent visualizations like histograms.
• Cumulative Percentage with Combined Groups: Cumulative percentages become meaningful
when groups are combined systematically, allowing readers to understand how much of the data
lies below a certain threshold or within certain ranges.
• Histogram Representation:
o Histograms are ideal for displaying numeric data as they show the frequency of data
within each interval in a visually intuitive way.
o Numeric data in histograms are arranged in ascending or descending order, often from
low to high, to represent the progression of the variable.
o Each bar in a histogram corresponds to an interval, and the height of the bar indicates the
frequency or count within that interval.
• Including Percentages in Bars: Adding percentages inside histogram bars can help readers quickly
understand the relative frequency of each interval, making the visualization more informative and
easier to interpret.
, Clip 3 (Descriptive Statistics: numerical measures)
Recommendation = start the data analysis of each project with descriptive
statistics, to get an overview of the data and start understanding what the
data means
Properties of distributions 1:
A population is the underlying group of people you want to
study to make statement. if you study the whole population you
will end up with a probability distribution like shown in the
graph to the left.
If you study a sample of a population you will end up with the
histogram of the frequency distribution like shown in the
graphs to the right.
There are key characteristics to the distribution in a graph. Like the measure of central tendency, it is
a statistical concept that represents the center or typical value of a dataset. It is used to describe a
single value that summarizes the distribution of data. The three main measures of central tendency
are:
1. Mode: The mode is the value that appears most frequently in a dataset. It represents the most
common or repeated value. A dataset can have one mode, more than one mode, or no mode at
all if all values occur with the same frequency.
a. The mode is useful for categorical data (like favorite colors or types of cuisine) where we
want to identify the most popular or common category.
2. Median: The median is the middle value in a dataset when all values are arranged in ascending or
descending order. It represents the "center" point of the data. If there is an even number of data
points, the median is the average of the two middle values.
a. The median is especially useful for skewed distributions (where data are not symmetrically
distributed) because it is not affected by extreme values (outliers) as much as the mean
is. For example, in income data where a few high salaries could skew the mean, the
median provides a more representative central value.
3. Mean (x̄): The mean, often represented as 𝑥̅ , is the average of all values in a dataset. It is calculated
by adding up all the values and dividing by the total number of values:
∑
𝑥̅ =
𝑛
a. The mean is widely used in statistical analysis because it takes into account all values in
the dataset. However, it is sensitive to extreme values or outliers, which can distort the
average. For instance, in a dataset of salaries, one exceptionally high salary can increase
the mean significantly, making it less representative of the typical salary.