100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Statistics - Discovering statistics using IBM SPSS- Andy Field $4.55   Add to cart

Summary

Summary Statistics - Discovering statistics using IBM SPSS- Andy Field

13 reviews
 1685 views  174 purchases
  • Course
  • Institution
  • Book

Summary Discovering Statistics using IBM SPSS - Andy Field - Fifth edition Radboud University Business Administration - Chapter 1,2,7,8,9,10,12,19 - including overview of symbols

Preview 9 out of 61  pages

  • No
  • H 1, 2, 5.4-5.5, 6.6, 7, 8, 9, 10, 12, 19
  • December 15, 2019
  • 61
  • 2019/2020
  • Summary

13  reviews

review-writer-avatar

By: mijndertvandijk • 9 months ago

review-writer-avatar

By: Mathias • 1 year ago

review-writer-avatar

By: studentFenB • 1 year ago

review-writer-avatar

By: pascalleburggraaf • 2 year ago

review-writer-avatar

By: pernelleguastavino • 2 year ago

review-writer-avatar

By: RVDVV • 2 year ago

review-writer-avatar

By: ricardo10-8 • 3 year ago

Show more reviews  
avatar-seller
STATISTICS
Radboud University - Discovering statistics
using IBM SPSS Statistics - Andy Field –
December 2019
1




D. Folmer

,Chapter 1 Why is my evil lecturer forcing me to learn statistics?

The research process
You begin with an observation that you want
to understand. From your initial observation
you consult relevant theories and generate
explanations (hypotheses) for those
observations, from which you can make
predictions. To test your predictions, you need
data. First you collect some relevant data and
then analyse it. The analysis of the data may
support your hypothesis, or generate a new
one, which in turn might lead you to revise the
theory.

Initial observation
The first step is to come up with a question
that needs an answer. Having made a causal observation about the world, you need to collect some
data to see whether this observation is true. To do this you need to define one or more variables to
measure that quantify the thing I’m trying to measure.


Generating and testing theories and hypothesis
The next logical thing to do is to explain these data. The first step is to look for relevant theories. A
theory is an explanation or set of principles that is well substantiated by repeated testing and
explains a broad phenomenon. You could also have a critical mass of evidence to support the idea.

A hypothesis is a proposed explanation for a fairly narrow phenomenon or set of observation. It is
not a guess, but an informed, theory-driven attempt to explain what has been observed. Both
theories and hypotheses seek to explain the world, but a theory explains a wide set of phenomena
with a small set of well-established principles, whereas a hypothesis typically seeks to explain a
narrower phenomenon and is, as yet untested. Both theories and hypothesis exist in the conceptual
domain and you cannot observe them directly.

To test a hypothesis, you need to move from the conceptual domain into the observable domain. So,
we need to operationalize the hypothesis in a way that enables us to collect and analyse data that
have a bearing on the hypothesis. We do this using predictions. Predictions emerge from a
hypothesis, so they are not the same thing, transforming it from something unobservable to
something that is.

If the data contradict the hypothesis, it is known as falsification. (the act of disproving a theory or
hypothesis)

Collecting data – measurement
Independent and dependent variables
To test the hypothesis we need to measure variables. Variables are things that can change or vary.
Most hypothesis can be expressed in two terms of variables; a proposed cause and a proposed
outcome. A variable we think is a cause is known as an independent variable (predictor variable),
because the value does not depend on any other variables.
2




D. Folmer

,A variable that we think is an effect or outcome is called a dependent variable (outcome variable),
because the value depends on the cause.

Levels of measurement
The relationship between what is being measured and the numbers that represent what is being
measured is known as the level of measurement.

A categorical variable is made up of categories; like species and nationality.
- Binary variable (Dichotomy); consists of only two categories, example male vs female
- Nominal variable; exists out of categories whit just a label, example hair colour
- Ordinal variable; exists out of categories who can be ranked, example level of tolerance

A continuous carriable is one that gives us a score for each person and can take on any value of the
measurement scale that we are using; like temperature and age
- Interval variable; not only classifies and orders the measurements, but it also specifies that
the distances between each interval on the scale are equivalent along the scale from
low interval to high interval. There is no absolute zero. Example: temperature
- Ratio variable; that measurement is the estimation of the ratio between a magnitude of a
continuous quantity and a unit magnitude of the same kind. There is an absolute zero.
Example: Age

Continuous variables can be continuous but also discrete. This is quite a tricky distinction. A truly
continuous carriable can be measured to any level of precision, whereas a discrete variable can take
only certain values (usually whole numbers) on the scale.

Measurement error
It’s one thing to measure variables, but it’s another thing to measure them accurately. Ideally we
want our measure to be calibrated such that values have the same meaning over time and across
situations. There will often be a discrepancy(inconsistency)between the number we use to represent
the thing we are measuring and the actual value of the thing we are measuring. This is known as
measurement error.

Validity and reliability
Validity refers to whether an instrument measures what it was designed to measure. There are few
kinds of validity:
- Criterion validity: the instrument is considered valid if we are able to accurately predict
another variable, referred to as the criterion variable, using the results of the measurement.
- Concurrent validity: a measuring instrument has concurrent validity if it correctly predicts
the scores on a certain criterion variable that is currently being measured.
- Predictive validity: a measuring instrument predict correctly the scores on a criterion
variable that lies in the future.
- Content validity: relates to whether the final measuring instrument provides a good
reflection of the theoretical concept that is being measured. Only type of validity that can be
determined prior to the data collection. For the content validity to be maximized the concept
needs to be operationalized as accurate as possible.

Reliability refers to the ability of the measure to produce the same results under the same
conditions. The easiest way to assess reliability is to test the same group twice. A reliable instrument
will produce similar scores at both points in time (test-retest reliability)
3




D. Folmer

,Collecting data: research designs
We now look at how data is collected. There are two ways to test a hypothesis; either by observing
what naturally happens or by manipulating some aspect of the environment and observing the effect
is has on the variable that interests us (experiment).

Correlational research methods
In correlational research we observe natural events; we can do this by either taking snapshots or
many variables at a single point in time, or by measuring variables repeatedly at different time points
(longitudinal research). Correlation research provides a very natural view of the question we’re
researching because we are not influencing what happens and the measures of the variable should
not be biased by the researcher being there (ecological validity).

Experimental research
Most scientific questions imply a link between variables. Even when the cause-effect relationship is
not explicitly stated, most research questions can be broken down into a proposed cause and
proposed outcome. These are both variables. The key to answering the research question is to
uncover how the proposed cause (independent) and the proposed outcome (dependent) relate to
each other.

In correlational research variables are often measured simultaneously. There are two problems with
doing that:
1. It provides no information about the contiguity between different variables. (proximity)
2. It doesn’t always distinguish between what we might call an accidental conjunction and a
causal one.

There can also be a third person or thing that influences the relationship which is called the tertium
quid. The extraneous(extra variables that you didn’t know where there) factors are sometimes called
cofounding variables, or cofounds.

To rule out confounding variables, Mill proposed that an effect should be present when the cause is
present and that when the cause is absent, the effect should be absent also.

Two methods of data collection
When we use an experiment to collect data, there are two ways to manipulate the independent
variable:
1. Test different entities; when different groups of entities take part in each experimental
condition. So every group has one condition (a between-groups, between subject or
independent design)
2. Manipulate the independent variable using the same entities. This means that one group
takes part in each experimental condition. So they will see all the kind of conditions there
are. (a with-in subject or repeated-measures design)

Two types of variation
- Systematic variation: this variation is due to the experimenter doing something in one
condition but not in the other condition
- Unsystematic variation: this variation results from random factors that exist between the
experimental conditions (natural differences in ability, time of day, etc.)

In a repeated-measures design, differences between two conditions can be caused by only two
things:
4




D. Folmer

, 1. The manipulation that was carried out on the participants
2. Any other factor that might affect the way in which an entity performs from time to time.

In an independent design, differences between the two conditions can also be caused by two things:
1. The manipulation that was carried out on the participants
2. Differences between the characteristics of the entities allocated to each group.

Randomization
It is important to keep the unsystematic variation to a minimum, scientist use randomization to
achieve this goal. Many statistical tests work by identifying the systematic and unsystematic sources
of variation and then comparing them. Randomization eliminates most other sources of systematic
variation, which allows us to be sure that any systematic variation between experimental conditions
is due to the manipulation of the independent variable.

The two most important sources of systematic variation in repeated-measures design is:
- Practice effects: participants may perform differently in the second condition because of
familiarity with the experimental situation and/or measures being used.
- Boredom effects: participants may perform differently in the second condition because they
are tired or bored from having completed the first condition.

We can ensure that they produce no systematic variation between our conditions by
counterbalancing the order in which a person participates in a condition.

In independent design he best way to reduce the systematic variation is to randomly assign
participants to conditions.

Analysing data - central tendency
Frequency distributions
Once you have collected some data a very useful thing to do is
to pot a graph of how many time each score occurs. This is
known as a frequency distribution or histogram. In an ideal
world our data would be distributed symmetrically around the
centre of all scores. If we drew a line through this, this would
be known as a normal distribution. This is characterized by
the bell-shaped curve and implies that the majority of the
scores lie around the centre of the distribution. (So the largest
bars on the histogram around the central value)

There are two main ways in which a distribution can differentiate from normal:
1. Lack of symmetry (skewness) ->not symmetrical, most frequent scores are clustered on the
end of the scale. You have a positively skewed and a negatively skewed distribution.
5




D. Folmer

, 2. Pointiness (kurtosis) -> refers to the degree to which scores cluster at the ends of the
distribution (also known as tails) you have positive kurtosis (Leptokurtic) and negative
kurtosis (Platykurtic)




The mode
We can calculate where the centre of a frequency distribution lies, this is known as the central
tendency. There are three central tendencies, the mode, the median and the mean.

The mode is the score that occurs most frequently in the data set. this is easy to see in a frequency
distribution, because it will be the tallest bar. To calculate the mode, you have to look at which
number occurs the mode. One problem with the mode is that it can take on several values. You have
for example a distribution with two modes (bimodal) or a data set with more than two modes
(multimodal)




The median
To quantify the centre of a distribution is to look for the middle score when scores are ranked in
order of magnitude (low to high). This is called the median, so the midst observation. You can
calculate the median in two ways, for the even and the odd number of scores.

Odd:
22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 234

1. You put the numbers in order from low to high, in this case it already is.
2. You use the formula: (n+1)/2
3. (11+1)/2 = 12/2 = 6 (11 is the number of scores in the row above)
4. The sixth observation is 98
5. The median is 98

Even:
22, 40, 53, 57, 93, 98, 103, 108, 116, 121

1. You put the numbers in order from low to high, in this case it already is.
2. You use the formula: (n+1)/2
6




D. Folmer

, 3. (10+1)/2 = 11/2 = 5.5 (11 is the number of scores in the row above)
4. If the number is not an even number of scores, the median is the average of the middle two
values.
5. The median is (83+93)/2 = 95.5

As you can see when we take away the extreme score, the median only chances from 98 to 95.5. the
median is relatively unaffected by extreme scores, it is also unaffected by skewed distributions.

The mean
The mean is the measure of central tendency which is most heard of, because it is equal to the
average. So the mean is the same as the average score. There is a formula for that:




This basically means to sum up all the scores and divide them by the total number of scores.

22 + 40 + 53 + 57 + 93 + 98 + 103 + 108 + 116 + 121 + 234 = 1045
1045/11 = 95

If you take away the 234, the mean will drop to 81,1. This means that the mean is influenced by
extreme scores, it also if affected by skewed distributions.

Analysing data - Dispersion
It is also interesting to quantify the spread or dispersion of scores.

Range
The easiest form of dispersion is the range. You do this by subtracting the lowest score of the highest
score. if we keep following the previous example: the lowest score is 22, the highest score is 234.
234-22 = 212.
The range is dramatically effected by extreme scores.

Interquartile range
To avoid the extreme scores you can also use the interquartile range, this means that you only look
at the midst 50% of the scores. You abandon the lowest 25% and above 75%. So you cut off the top
and bottom 25% an calculate the middle 50% scores. Quartiles: are the three values that split the
data into four equal parts of 25% each.

1. First we calculate the median -> second quartile
2. We calculate the lower quartile -> bottom 25%
3. We calculate the upper quartile -> top 25% (above 75%)
4. Rule of thumb: the median is not included in the two halves when they are split.
5. Calculate the interquartile range, by subtracting the lowest quartile of the upper quartile.

22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 234
1. The median is 98
2. 22, 40, 53, 57, 93 → ( 5+1)/2 = 3 the third score is 53
3. 103, 108, 116, 121, 234 → (5+1)/2 = 3 the third score is 116
4. The interquartile range is 116-53 = 63

The advantage is that it isn’t affected by extreme scores, however you lose half of the data.
7




D. Folmer

,Standard deviation
The standard deviation indicates the distribution of the answers around the average (mean).
If you want to calculate the standard deviation, also known as SD, you first have to calculate some
other things. First you need to know the mean of the data.

Then you have to calculate the deviance. This means the difference between each score and the
mean. As a formula this would look like this:


Then you have to calculate the total deviance. You do this by just adding all the deviances from the
previous step.



Then you have to calculate the sum of squared errors. The only thing you, although it looks very
complicated, is square each deviance score. so if you have a deviance of 4, than you just square it,
4 x 4 = 16.



We are going to use the previous sample and calculate everything, so we eventually can calculate the
standard deviation.
Number mean Deviance Deviance squared ²
22 95 -73 22-95 5329 -73 x -73
40 95 -55 40-95 3025 -55 x -55
53 95 -42 53-95 1764 -42 x -42
57 95 -38 57-95 1444 -38 x -38
93 95 -2 93-95 4 2x2
98 95 3 98-95 9 3x3
103 95 8 103-95 64 8x8
108 95 13 108-95 169 13 x 13
116 95 21 116-95 441 21 x 21
121 95 26 121-95 676 26 x 26
234 95 139 234-95 19321 139 x 139
Sum: 32246


1. We put all the number in a row under each other from low to high
2. We calculate the mean which is 95; 22 + 40 + 53 + 57 + 93 + 98 + 103 + 108 + 116 + 121 + 234
= 1045, 1045/11 = 95
3. Now we are going to calculate the deviance, which in the first case is 22-95 = -73
4. We are now going to square all the deviances, one by one. Remember when a negative
number is squared it will be positive.
5. We calculate the sum of squared errors, which is adding up all the deviances squared

After doing this you have to calculate the variance, which is the average dispersion. The variance is
simply the sum of squares divided by N-1

𝑺𝑺
Variance: 𝑵−𝟏
8




D. Folmer

, 6. So calculate the variance which is: Sum of squared errors; 32246 divided by 11-1, 32246/10=
3224.6

The final step is to calculate the standard deviation, in which now you only have to take the root of
the variance.
7. √3224.6 = 56,79
8. The Standard deviation is 56,79

A small standard variation (relatively to the mean) indicates that the data points are close to the
mean. A large standard deviation (relatively to the mean) indicates that the data points are distant
from the mean.

Analysing data - Probability
Another way to think about frequency distribution is not in terms of how often scores actually
occurred, but how likely it is that a score would occur, so the probability.

We are now going to talk about a new example, the ice bucket challenge. In the graph below you can
see how often a video has been put on YouTube by day regarding the ice bucket challenge, since the
first video.




You can ask; what is the likelihood of a video being posted 35-40 days into the challenge? We can find
that out by calculating the orange bars in the graph. The orange part is: (196 + 204 + 196 +174 +164
+141) = 1075. The total number is 2323. 1075/2323 = 0.46, which you can also see in percentages
which is 46%. So to answer the question. The likelihood that a video is being posted 35-40 days Is
quite likely, because in those 6 days 46% of the videos were placed.

A probability value can range from 0 (there is no chance) to 1 (It definitely will happen). For any
distribution in scores we could, in theory, calculate the probability of obtaining a score of a certain
size. It would however be very complex to do it, but we could. Statisticians have identified several
common distributions. For each one they have worked out a mathematical formula (Probability
Density Functions PDF). We could draw such a function by plotting the value of the variable (x)
against the probability of occurring (y) , the resulting curve is known as the probability distribution.
9




D. Folmer

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller dienekef. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.55. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

67474 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$4.55  174x  sold
  • (13)
  Add to cart