100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary of Applied Data Analysis - Lecture notes and Book (Field) $13.40   Add to cart

Summary

Summary of Applied Data Analysis - Lecture notes and Book (Field)

 38 views  0 purchase
  • Course
  • Institution
  • Book

This is a summary for the course Applied Data Analysis. It consists of extensive lecture notes and a summary of the book by Field. My grade for the exam was a 9.2.

Preview 4 out of 45  pages

  • No
  • Unknown
  • June 21, 2022
  • 45
  • 2021/2022
  • Summary
avatar-seller
ADA Summary lectures + book

Lecture 1
1. Exploration

Main steps in data analysis:
1) Explore, look what is in your data
2) Check assumptions. Significance tests assume things about data,
but does it apply in your case?
3) Hypothesis testing. Determine if a predicted relationship exists in
the data, and if it can be generalized from sample to population.
4) Interpretation. Analyze the nature of the relationships between
variables
5) Write. Report your results.

There are 2 basic ways of exploration. You can:
1. Make pictures
2. Compute statistics

1. Making pictures
Histograms  Bar charts/frequency distribution
You can use this for checking for normality. Histograms are useful to look
at the shape of your data and spot problems. It plots a single variable (x-
axis) against the frequency of scores (y-axis). There are:
1) Simple histograms: good to visualize frequencies of scores for a single
variable
2) Stacked histograms: good for having a grouping variable in which each
bar is split by group (such as male/female)
3) Population pyramids: this is like a stacked histogram, it shows the
relative frequency of scores in two populations. Useful for comparing
distributions across groups.
Positively skewed – if the long tail of the distribution goes right (skewness
> 0)
Negatively skewed – if the long tail of the distribution goes left (skewness
< 0)
Positive values of skewness indicate too many low scores in the
distribution, whereas negative values indicate a build-up of high scores.




Boxplots
Concise and informative way of presenting a frequency distribution. Uses
percentiles and medians. In a boxplot, you have the box, which consists of
the 25th percentile, median (50th percentile), and the 75th percentile. So it
has the middle 50% of scores. A larger box thus means that the middle

,50% of scores are more spread out. Under the 25th percentile, you have
scores that are in between the lowest non-outlying score and the 25th
percentile. Above the 75th percentile, you have scores that are in between
the 75th percentile and the highest non-outlying score. Outliers are shown
as dots in the picture. Scores are outliers when they are 1.5-3 box heights
from the box. An extreme score is more than 3 box heights from the box.
In the boxplot, you can also see skewness. When you look at the stick and
outliers, you can see if these are positively or negatively skewed.
Positively: longer stick and outliers at the bottom. Negatively: longer stick
and more outliers at the top. If the distribution is symmetrical, the sticks
are also symmetrical (of the same length).

Boxplots can be ugly. For example, the lower stick can be absent. This
occurs when 25% of the scores have the lowest possible score.

Note: you can only use boxplots when the variables are measured on
comparable scales. Furthermore, plots with small scales are almost
unreadable.

There are three types of boxplots:
1) 1-D Boxplot: a single boxplot of all scores for the chosen outcome
2) Simple boxplot: produces multiple boxplots for the chosen outcome by
splitting the data by a categorical variable. Useful for displaying different
boxplots on the same graphs for groups
3) Clustered boxplot: same as the simple boxplot, except that it splits data
by a second categorical variable.

Scatterplot
A graph that plots each person’s score on one variable against their score
on another. It visualises the relationship between the variables (you use it
to see if there is a correlation).
1) Simple scatter: to plot values of one continuous variable against
another. Just for looking at two variables. Although we can still talk about
predictors and outcomes, these terms do not imply causal relationships.
Here you can also ask SPSS for a regression line that summarizes the
relationship between variables in scatterplots.
2) Grouped scatter: like a simple scatter, except you can display points
belonging to different groups (a third categorical variable) in different
colours. You can also add a regression line for each group.
3) Simple dot plot (density plot): like a histogram, except that, rather than
having a summary bar representing the frequency of scores, individual
scores are displayed as dots. Like histograms, they are useful for looking
at the shape of the distribution of scores.
4) Scatterplot matrix: produces a grid of multiple scatterplots that are
showing the relationships between multiple pairs of variables in each cell
of the grid. Allows you to see the relationship between all combinations of
many different pairs of variables. Very convenient for examining pairs of
relationships between variables.
2. Compute statistics

,* Skewness: measure of asymmetry of the distribution. Can be negatively
or positively skewed. See picture above.

* Kurtosis: measure of peakedness of a distribution.
Perfectly normal = 0
Peak higher than normal > 0
Peak lower than normal < 0

Difference standard error and standard deviation:
Standard error: measure of accuracy of statistic
Standard deviation: measure of spread in the sample

Hypothesis testing with z-value
To test the null hypothesis that skewness = 0 or kurtosis = 0 you use the z
value. To obtain z value, you divide kurtosis or skewness by the standard
error. The further the value is from 0, the more likely it is that the data are
not normally distributed. Positive values of skewness indicate too many
low scores in the distribution, negative values indicate a build-up of high
scores. For kurtosis, positive values indicate a heavy-tailed distribution,
whereas negative values indicate a light-tailed distribution.
Kurtosis or skewness / std error (you can find this in SPSS descriptives
table) = z value. Z values for significance are: -1.96 and 1.96. Any z-score
above or below 1.96, is significant at the p<.05 level, 2.58 for the p<.01
level and 3.29 for the p<.001 level.

Hypothesis testing with Kolmogorov Smirnov test
This is a more direct normality test. It tests whether a distribution is
significantly different from normality. It compares the scores in the sample
to a normally distributed set of scores with the same mean and standard
deviation. If the KS tests are highly significant (at least p <.01), we have to
assume that none of the distributions are normal. You use this test with
Lilliefors correction in SPSS. So, you do not want significance in the KS
test, because then it tells us that the distribution of the sample is not
significantly different from a normal distribution.
SPSS: Explore – plots – tick ‘normality plots with tests’.

You can also use graphs to spot normality. For example the P-P plot plots
the cumulative probability of a variable against the cumulative probability
of a particular distribution. The actual z-score is plotted against the
expected z-score. If the data are normally distributed, then the actual z-
score will be the same as the expected z-score, and you’ll get a straight
diagonal line. So, if values fall on the diagonal of the plot then the variable
is normally distributed. When the data is consistently above or below the
diagonal then this shows that the kurtosis differs from a normal
distribution, and when the data points are S-shaped, the problem is
skewness. The Q-Q plot is the same as P-P, except it plots the quantiles of
the data instead of every individual score. It can be interpreted exactly the
same as a P-P plot.

, As sample size gets larger, the assumption of normality matters less. In
large samples, a test of normality is more likely to be significant. This is
the case for when group sizes are larger than 15 (we call this the central
limit theorem). So, in larger samples you should not do significance tests
and instead look at the shape of the distribution visually, interpret the
value of the skewness and kurtosis statistics, and possibly don’t even
worry about normality at all.

Hypothesis testing with Levene’s test
This test tells us whether the assumption of homogeneity of variances is
violated. If Levene’s test is significant, this means that it is violated and
equal variances are thus not assumed. If there are equal group sizes
(Nmax / Nmin < 1.5), F tests are robust for violation. For t-tests, you
always use the Welch’s t-test which always uses the equal variances are
not assumed option (further explanation in t-test part).

Lecture 1
2. Crosstabulation

Chi-square test
Investigate and test the relationship between two categorical (or nominal)
variables. You cannot use the mean or any similar statistic for categorical
variables. This test is based on the simple idea of comparing the
frequencies you observe in certain categories to the frequencies you might
expect to get in those categories by chance.
For example: to test the relationship between gender and kind of study
(alpha, beta or gamma). This statistic is based on the simple idea of
comparing the frequencies you observe in certain categories to the
frequencies you might expect to get in those categories by chance.

Can be used for testing the following null hypothesis:
H0: the two nominal variables are completely independent from each
other.
Chi squared is based on a comparison between observed cell frequencies
and expected cell frequencies if H0 is true. For example: if O (observed) is
10 and 30, than E (expected) would be 20 and 20.
It is based on the discrepancy that occurs for each cell between what one
would expect under independence and what was actually observed.

What are degrees of freedom?  the number of free observations. To
know if your chi squared value is significant, you have to look in a table.
Where you have to look in that table, is dependent on degrees of freedom
(You don’t actually have to look in this table, since we don’t do it by hand –
however, maybe you remember doing this in your bachelor). “There exists
a whole family of different chi squared distributions, dependent on
degrees of freedom” .

In chi squared test, df = (I – 1) * (J – 1). In which I and J are the numbers of
categories for each variable. You can also interpret it as df = (r – 1) * (c –
1) in which r is the number of rows and c is the number of columns.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller saravandeven. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $13.40. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

67474 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$13.40
  • (0)
  Add to cart