Summary Applied Data Analysis
Lecture 1 Exploration and crosstabulation
Chapter 2 (§ 2.1 – 2.10), chapter 3 (§ 3.1 – 3.7), chapter 5 (§ 5.1 – 5.9), chapter 6 (p. 243-252, 268-
276), chapter 19 (§ 19.1 – 19.3.6, 19.7)
Exploration 1: Pictures
Why explore?
Generally, research is (and should be) hypothetical-deductive.
o Formulate a hypothesis (on theoretical grounds) and deduce which pattern of results
should follow from it.
o Collect data to test whether these hypotheses apply (in the population).
Usually, this leads to a focused prediction (e.g., females have higher social skills score than
males: µf > µm). je verwacht dus dat het populatie gemiddelde van vrouwen hoger is dan het
populatie gemiddelde van mannen.
However, do not limit yourselve to that prediction!
o Sometimes, unexpected results are the most interesting ones (isn’t science about
finding out new things);
o Almost always, we need to check assumptions of hypothesis tests.
Main steps in data analysis
1. Explore. Look what’s in your data.
2. Check assumptions. Significance tests make assumptions about the data, but do they apply in
your case? (and if violated, what has to be done?)
3. Hypothesis testing. Determine if a predicted relationship exists in the sample (e.g. a
correlation between two variables) and if it can be generalized from sample to population.
4. Interpretation. Analyze the nature of the relationships between variables.
5. Write. Report your results (following APA rules).
Preliminary step. Decide which technique is most suitable for your research question.
Exploring frequency distributions
Two basic ways of exploration
Make pictures (histograms, boxplots)
Compute statistics (mean, median, mode, variance, standard deviation, skewness, kurtosis,
Kolgomorov-Smirnov test).
We will do both, with emphasis on normality.
Remark. Very often the normality assumption is not as important as suggested by Field, because
many tests are robust against violation of this assumption.
1
,SPSS procedure Explore
Histograms
Data. 468 working people from four occupational groups, on three indicators of distress.
Histogram. Picture of a frequency distribution (categories on X-axis, numbers of individuals on Y-
axis).
Normality at first sight. From left to right more deviation from normality.
Tweede afbeelding, positive skewness. Derde afbeelding nog meer positive skewness.
2
,Boxplot (1) What it does
Concise and informative way of presenting a frequency distribution.
Het totale cremekleurige blokje staat voor 50 procent van de scores. Alles bij de boxplot is in termen
van percentielen.
Boxplot (2). A normal beauty
Warning. Boxplots are based on percentiles (median is 50th percentile). They
do not necessarily give the same results as measures based on means and
variances.
Boxplot example
Some signs of asymmetry (e.g. 3th quartile smaller than 2th more
scores closely above than closely beneath median; we see this as a
peak in the histogram).
There are some outliers, all positive (i.e there are extraordinarily
dissatisfied, but not extraordinarily satisfied persons).
Boxplot (3): an ugly one
No perfect normality or symmetry in previous boxplot,
but it can be much worse. Look at anxiety boxplot.
Very positively skewed distribution.
Most people are low on anxiety: more than 25 %
has lowest possible score ( 25th percentile =
lowest score no “stick” under box)
A lot of outliers and extreme scores.
Boxplot (4). Comparing groups
3
, Use boxplots to compare different variables, or to compare different groups on same variable (here:
occupation).
Conclusions
• Box heights are not too different.
• Medians and boxes of first two groups are higher than of last two groups, so they appear to be
more dissatisfied.
• Only positive outliers (i.e persons who are very dissatisfied in comparison to the rest of their
group)..
Boxplots (5). Comparing variables
Boxplots for different variables are only useful when variables have comparible measuring scales.
Combined boxplot is not very useful here.
Scales are not comparable (so comparing medians or box heights is not interesting).
Plots with relatively small scales(anxiety, dissatisfaction) are almost unreadable.
Exploration 2: Statistics
Skewness and kurtosis (1)
Skewness: measure of asymmetry of the distribution.
• perfect symmetry skewness = 0 (example normal distribution);
• long tail of distribution to the right skewness > 0 = veel lage scores en weinig hoge scores;
• long tail of distribution to the left skewness < 0 = weinig lage scores en veel hoge scores.
4