Samenvatting

Statistics summary Pre-MSc

5 keer verkocht

Instelling
Rijksuniversiteit Groningen (RuG)

Useful statistics summary for the exam of Business Research Methods for Pre-MSc. Formulas included! With references to which appendix you need from the book for some calculations.

[Meer zien]

Voorbeeld 8 van de 53 pagina's

Bekijk voorbeeld

Heel boek samengevat? Nee
Wat is er van het boek samengevat? Chapter 1 to 13, chapter 15 to 17, and chapter 19 (see table of contents for the specifi paragraphs)
Geupload op 14 december 2020
Aantal pagina's 53
Geschreven in 2020/2021
Type Samenvatting

statistics
nominal data
ordinal data
interval data
probability
variables
normal distribution
sampling distribution
mean
median
mode
hypothesis testing
null hypothesis
graphical descriptive techniques

€5,49

In winkelwagen

Opslaan

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Inhoud
C1 What is statistics? .............................................................................................................................................. 3
1.1 Key statistical concepts ................................................................................................................................. 3
1.2 Statistical applications in business ................................................................................................................ 3
1.3 Large real data sets ....................................................................................................................................... 4
C2 Graphical descriptive techniques I ..................................................................................................................... 4
2.1 Types of data and information ..................................................................................................................... 4
2.2 Describing a set of nominal data .................................................................................................................. 5
2.3 Describing the relationship between two nominal variables and comparing two or more nominal data
sets ...................................................................................................................................................................... 5
C3 Graphical descriptive techniques II .................................................................................................................... 6
3.1 Graphical techniques to describe a set of interval data ............................................................................... 6
3.2 Describing time-series data .......................................................................................................................... 7
3.3 Describing the relationship between two interval variables ........................................................................ 8
C4 Numerical descriptive techniques ...................................................................................................................... 8
4.1 Measures of central location ........................................................................................................................ 8
4.2 Measures of variability ................................................................................................................................. 9
4.3 Measures of relative standing and box plots .............................................................................................. 10
4.7 Comparing graphical and numerical techniques ........................................................................................ 10
4.8 General guidelines for exploring data ........................................................................................................ 11
C6 Probability ........................................................................................................................................................ 11
6.1 Assigning probability to events ................................................................................................................... 11
6.2 Joint, marginal, and conditional probability ............................................................................................... 12
6.3 Probability rules and trees .......................................................................................................................... 12
C7 Random variables and discrete probability distributions ................................................................................ 13
7.1 Random variables and probability distributions ......................................................................................... 14
7.4 Binomial distribution .................................................................................................................................. 15
C8 Continuous probability distributions ............................................................................................................... 16
8.1 Probability density functions ...................................................................................................................... 16
8.2 Normal distribution .................................................................................................................................... 17
8.4 Other continuous distributions ................................................................................................................... 18
C9 Sampling distributions ..................................................................................................................................... 20
9.1 Sampling distribution of the mean ............................................................................................................. 20
9.2 Sampling distribution of a proportion ........................................................................................................ 21
9.3 Sampling distribution of the difference between two means .................................................................... 23
9.4 From here to inference ............................................................................................................................... 23
C10 Introduction to estimation ............................................................................................................................. 23
10.1 Concepts of estimation ............................................................................................................................. 24

, 10.2 Estimating the population mean when the population standard deviation is known ............................. 24
10.3 Selecting the sample size .......................................................................................................................... 25
C11 Introduction to hypothesis testing ................................................................................................................. 26
11.1 Concepts of hypothesis testing ................................................................................................................. 26
11.2 Testing the population mean when the population standard deviation is known ................................... 27
11.4 The road ahead ......................................................................................................................................... 29
C12 Inference about a population ........................................................................................................................ 30
12.1 Inference about a population mean when the standard deviation is unknown ...................................... 30
12.2 Inference about a population variance..................................................................................................... 31
12.3 Inference about a population proportion ................................................................................................. 32
C13 Inference about comparing two populations ................................................................................................ 33
13.1 Inference about the difference between two means: independent samples .......................................... 33
13.2 Observational and experimental data ...................................................................................................... 35
13.3 Inference about the difference between two means: matched pairs experiment .................................. 35
13.4 Inference about the ratio of two variances .............................................................................................. 36
13.5 Inference about the difference between two population proportions .................................................... 37
C15 Chi-squared tests ........................................................................................................................................... 38
15.1 Chi-squared goodness-of-fit test .............................................................................................................. 38
15.2 Chi-squared test of a contingency table ................................................................................................... 39
15.3 Summary of tests on nominal data ........................................................................................................... 40
C16 Simple linear regression and correlation ....................................................................................................... 41
16.1 Model ........................................................................................................................................................ 41
16.2 Estimating the coefficients ....................................................................................................................... 42
16.3 Error variable: required conditions ........................................................................................................... 43
16.4 Assessing the model ................................................................................................................................. 43
16.5 Using the regression equation .................................................................................................................. 45
C17 Multiple regression ........................................................................................................................................ 46
17.1 Model and required conditions ................................................................................................................ 46
17.2 Estimating the coefficients and assessing the model ............................................................................... 46
C19 Nonparametric statistics ................................................................................................................................ 49
19.1 Wilcoxon Rank Sum Test ........................................................................................................................... 49
19.2 Sign test and Wilcoxon Signed Rank Sum Test ......................................................................................... 51
19.4 Spearman Rank Correlation Coefficient ................................................................................................... 52

2

,C1 What is statistics?
Statistics is a way to get information from data.

Descriptive statistics deal with methods of organizing, summarizing, and presenting data in a
convenient and informative way. One form of descriptive statistics uses graphical techniques that
allow statistics practitioners to present data in ways that make it easy for the reader to extract useful
information. Another form of descriptive statistics uses numerical techniques to summarize data. The
actual technique we use depends on what specific information we would like to extract.

Inferential statistics is a body of methods used to draw conclusions or inferences about
characteristics of populations based on sample data.

When an election for important offices such as president or senator in large states takes place, the
network actively compete to see which one will be the first to predict a winner. This is done through
exit polls in which a random sample of voters who exit the polling booth are asked for whom they
voted.

1.1 Key statistical concepts
Statistical inference problems involve three key concepts:
1. Population
A population is the group of all items of interest to a statistics practitioner. It does not
necessarily refer to a group of people.
A descriptive measure of a population is called a parameter. In most applications of
inferential statistics the parameter represents the information we need.
2. Sample
A sample is a set of data drawn from the studied population.
A descriptive measure of a sample is called a statistic. We use statistics to make inferences
about parameters.
3. Statistical inference
Statistical inference is the process of making an estimate, prediction, or decision about a
population based on sample date. However, such conclusions and estimates are not always
going to be correct. For this reason, we build into the statistical inference a measure of
reliability. There are two such measures:
a. The confidence level: the proportion of times that an estimating procedure will be
correct.
b. The significance level: measures how frequently the conclusion will be wrong.

1.2 Statistical applications in business
To provide sufficient background to understand the statistical application we introduce applications
in accounting, economics, finance, human resources management, marketing, and operations
management. We will provide readers with some background to these applications by describing
their functions in two ways:

Application Sections and Subsections
We feature five sections that describe statistical applications in the functional areas of business. One
section and one subsection demonstrate the uses of probability and statistics in specific industries.

Application Boxes
For other topics that require less detailed description, we provide application boxes with a relatively
brief description of the background followed by examples or exercises.

3

,1.3 Large real data sets
The authors believe that you learn statistics by doing statistics. To provide practice we have created
six large real datasets, available to be downloaded from Keller’s website. Their sources are the:
➢ General Social Survey > has been tracking American attitudes on a wide variety of topics
➢ American National Election Survey > provide data about why Americans vote as they do

C2 Graphical descriptive techniques I
Before the data can be used to support a decision, they must be organized and summarized.
Although descriptive statistical methods are quite straightforward, their importance should not be
underestimated.

The most important factors that determine the appropriate method to use are (1) the type of data
and (2) the information that is needed.

2.1 Types of data and information
➢ A variable is some characteristic of a population or sample. We usually represent the name
of the variable using uppercase letters such as X, Y and Z. (The mark on a statistics exam for
example)
➢ The values of the variables are the possible observations of the variable. (The values of
statistics exam marks are the integers between 0 and 100, assuming the exam is marked out
of 100)
➢ Data are the observed values of a variable. Data is plural for datum. The mark of one student
is a datum. There are three types of data: interval, nominal, and ordinal.
o Interval data are real numbers, such as heights, weights, incomes, and distances. We
also refer to this type of data as quantitative or numerical. The intervals or
differences between values of interval data are consistent and meaningful.
o The values of nominal data are categories. The values are not number but instead
are words that describe the categories. Nominal data are also called qualitative or
categorical.
o Ordinal data appear to be nominal, but the difference is that the order of their
values has meaning. Because the codes representing ordinal data are arbitrarily
assigned except for the order, we cannot calculate and interpret differences.

Interval:
- values are real numbers;
- all calculations are valid;
- data may be treated as ordinal or nominal.
Ordinal:
- values must represent the ranked order of the data;
- calculations based on an ordering process are valid;
- data may be treated as nominal but not as interval.
Nominal:
- values are arbitrary numbers that represent categories;
- only calculations based on the frequencies or percentages of occurrence are valid;
- data may not be treated as ordinal or interval.

4

,2.2 Describing a set of nominal data
The only allowable calculation on nominal data is to count the frequency or compute the percentage
that each value of the variable represents. We can summarize the data in a table, which presents the
categories and their counts, called a frequency distribution. A relative frequency distribution lists
the categories and the proportion with which each occurs (%). We can use graphical techniques to
present a picture of the data. There are two graphical methods we can use: the bar chart and the pie
chart.

To extract useful information requires the application of a statistical or graphical technique. To
choose the appropriate technique we must first identify the type of data (nominal, ordinal, interval).

A bar chart is often used to display frequencies.
The bar chart is created by drawing a rectangle representing each category. The height of the
rectangle represents the frequency. The base is arbitrary.

A pie chart graphically shows relative frequencies.
A pie chart is simply a circle subdivided into slices that represent the categories. Because the entire
circle is composed of 360 degrees, a category that contains 25% of the observations is represented
by a slice of the pie that contains 25% of 360 degrees, which is equal to 90 degrees.

Factors that identify when to use frequency and relative frequency tables, bars and pie charts:
1. Objective: describe a single set of data
2. Data type: nominal or ordinal

2.3 Describing the relationship between two nominal variables and comparing two or
more nominal data sets
Techniques applied to single sets of data are called univariate. There are many situations where we
wish to depict the relationship between variables; in such cases, bivariate methods are required. A
cross-classification table (also called a cross-tabulation table) is used to describe the relationship
between two nominal variables.

There are several ways to store the data to be used in this section to produce a table or pie chart:
1. The data are in two columns. The first column represents the categories of the first nominal
variable, and the second column stores the categories for the second variable. Each row
represents one observation of the two variables. The number of observations in each column
must be the same.
2. The data are stored in two or more columns, with each column representing the same
variable in a different sample or population.
3. The table representing counts in a cross-classification table may have already been created.

Factors that identify when to use a cross-classification table
1. Objective: describe the relationship between two variables and compare two or more sets of
data.
2. Data type: nominal

5

,C3 Graphical descriptive techniques II
In chapter 2 introduced graphical techniques used to summarize and present nominal data. In this
chapter, we do the same for interval data.

3.1 Graphical techniques to describe a set of interval data
The most important of these graphical methods is the histogram. The histogram is not only a
powerful graphical technique used to summarize interval data but is also used to help explain an
important aspect of probability.

In the previous section a frequency distribution was created by counting the number of times each
category of the nominal variable occurred. We create a frequency distribution for interval data by
counting the number of observations that fall into each of a series of intervals, called classes, that
cover the complete range of observations. Notice that the intervals do not overlap, so there is no
uncertainty about which interval to assign to any observation. It is not essential to make the intervals
equally wide, but it makes the task of reading and interpreting the graph easier.

To create the frequency distribution manually, we count the number of observations that fall into
each interval. Although the frequency distribution provides information about how the numbers are
distributed, the information is more easily understood and imparted by drawing a picture or graph.
The graph is called a histogram. A histogram is created by drawing rectangles whose base are the
intervals and whose heights are the frequencies.

Table 3.2 - Approximate number of classes in histograms
Number of observations Number of classes
Less than 50 5-7
50-200 7-9
200-500 9-10
500-1,000 10-11
1,000-5,000 11-13
5,000-50,000 13-17
More than 50,000 17-20

An alternative to the guidelines listed in Table 3.2 is to use Sturges’ formula, which recommends that
the number of class intervals be determined by the following:
Number of class intervals = 1 + 3.3 log (n)

We determine the approximate width of the classes by subtracting the smallest observation from the
largest and dividing the difference by the number of classes:
Class width = (largest observation – smallest observation) / number of classes
➢ We often round the result to some convenient value. We then define our class limits by
selecting a lower limit for the first class from which all other limits are determined. The only
condition we apply is that the first class interval must contain the smallest observation.

Table 3.2 and Sturges’ formula are guidelines only. It is more important to choose classes that are
easy to interpret.

A histogram is said to be symmetric if, when we draw a vertical line down the center of the
histogram, the two sides are identical in shape and size. A skewed histogram is one with a long tail
extending to either the right or the left. The former is called positively skewed, and the latter is
called negatively skewed.

6

,A mode is the observation that occurs with the greatest frequency. A modal class is the class with the
largest number of observations. A unimodal histogram is one with a single peak. A bimodal
histogram is one with two peaks, not necessarily equal in height. Bimodal histograms often indicate
that two different distributions are present. A special type of symmetric unimodal histogram is the
one that is bell shaped.

One of the drawbacks of the histogram is that we lose
potentially useful information by classifying the
observations. A statistician named John Tukey introduced
the stem-and-leaf display, which is a method that to some
extent overcomes this loss.
The first step in developing a stem-and-leaf display is
to split each observation into two parts, a stem and a leaf.
There are several ways of doing this. For example, the
number 12.3 can be split so that the stem is 12 and the leaf is
3. Another method can define the stem as 1 and the leaf as 2
(ignoring the 3).
The stem-and-leaf display is similar to a histogram turned on its side. The length of each line
represents the frequency in the class interval defined by the stems. The advantage of the stem-and-
leaf display over the histogram is that we can see the actual observations.

The frequency distribution lists the number of observations that fall into each class interval. We can
also create a relative frequency distribution by dividing the frequencies by the number of
observations. In some situations, we may wish to highlight
the proportion of observations that lie below each of the
class limits. In such cases, we create the cumulative relative
frequency distribution. Another way of presenting this
information is the ogive (see picture), which is a graphical
representation of the cumulative relative frequencies.

Factors that identify when to use a histogram, ogive, or stem-and-leaf display
1. Objective: describe a single set of data
2. Data type: interval

3.2 Describing time-series data
Besides classifying data by type, we can also classify them according to whether the observations are
measured at the same time (cross-sectional data) or whether they represent measurements at
successive points in time (time-series data).
- Consider a real estate consultant who feels that the selling price of a house is a function if its
size, age and lot size. To estimate the specific form of the function, she samples 100 homes
recently sold and records the price, size, age and lot size for each home. These data are
cross-sectional: they are all observations at the same point in time.
- The real estate consultant is also working on a separate project to forecast the monthly
housing starts in the north-eastern US over the next year. To do so, she collects the monthly
housing starts in this region for each of the past 5 years. These 60 values (housing starts)
represent time-series data because they are observations taken over time.

Time-series data are often graphically depicted on a line chart, which is a plot of the variable over
time. It is created by plotting the value of the variable on the vertical axis and the time periods on the
horizontal axis.

7

, 3.3 Describing the relationship between two interval variables
Statistics practitioners frequently need to know how two interval variables are related. The
technique to describe the relationship between such variables is called a scatter diagram.

As was the case with histograms, we frequently need to describe verbally how two variables are
related. The two most important characteristics are the strength and direction of the linear
relationship. To determine the strength of the linear relationship, we draw a straight line through the
points in such a way that the line represents the relationship. If most of the points fall close to the
line, we say there is a linear relationship. If most of the points appear to be scattered randomly with
only a semblance of a straight line, there is no, or at best, a weak linear relationship.

Statisticians have produced an objective way to draw the straight line. The method is called the least
squares method, and it will be presented in Chapter 4.

If one variable increases when the other does, we say that there is a positive linear relationship.
When the two variables tend to move in opposite directions, we describe the nature of their
association as a negative linear relationship.

It is important to understand that if two variables are linearly related it does not mean that one is
causing the other. In fact, we can never conclude that one variable causes another variable. We can
express this more eloquently as:
Correlation is not causation.

C4 Numerical descriptive techniques
In this chapter we introduce the numerical descriptive techniques that allow the statistics
practitioner to be more precise in describing various characteristics of a sample or population. For
each measurement, we describe how to calculate both the population parameter and the sample
statistic. However, the formulas describing the calculation of parameters are not practical and are
seldom used.

4.1 Measures of central location
There are three different measures that we use to describe the center of a set of data:
1. The first is the arithmetic mean (the average), which we’ll refer to as the mean. The mean is
computed by summing the observations and dividing by the number of observations.
∑𝑁
𝑖=1 𝑥𝑖
Population mean: 𝜇= 𝑁
∑𝑛
𝑖=1 𝑥𝑖
Sample mean: x̄ =
𝑛
∑ = sum of → represents ‘summation’
∑𝑁𝑖=1 𝑥𝑖 = sum of xi from 1 to n → represents ‘summation of n numbers’
2. The second measure of central location is the median. The median is calculated by placing all
the observations in order (ascending or descending). The observation the falls in the middle
is the median. The sample and population medians are computed in the same way.

8

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.