Samenvatting

Summary of Applied Data Analysis - Lecture notes and Book (Field)

0 keer verkocht

Instelling
Universiteit Leiden (UL)

This is a summary for the course Applied Data Analysis. It consists of extensive lecture notes and a summary of the book by Field. My grade for the exam was a 9.2.

[Meer zien]

Voorbeeld 4 van de 45 pagina's

Bekijk voorbeeld

Heel boek samengevat? Nee
Wat is er van het boek samengevat? Onbekend
Geupload op 21 juni 2022
Aantal pagina's 45
Geschreven in 2021/2022
Type Samenvatting

Volgen

saravandeven Lid sinds 8 jaar 33 documenten verkocht

€12,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

ADA Summary lectures + book

Lecture 1
1. Exploration

Main steps in data analysis:
1) Explore, look what is in your data
2) Check assumptions. Significance tests assume things about data,
but does it apply in your case?
3) Hypothesis testing. Determine if a predicted relationship exists in
the data, and if it can be generalized from sample to population.
4) Interpretation. Analyze the nature of the relationships between
variables
5) Write. Report your results.

There are 2 basic ways of exploration. You can:
1. Make pictures
2. Compute statistics

1. Making pictures
Histograms  Bar charts/frequency distribution
You can use this for checking for normality. Histograms are useful to look
at the shape of your data and spot problems. It plots a single variable (x-
axis) against the frequency of scores (y-axis). There are:
1) Simple histograms: good to visualize frequencies of scores for a single
variable
2) Stacked histograms: good for having a grouping variable in which each
bar is split by group (such as male/female)
3) Population pyramids: this is like a stacked histogram, it shows the
relative frequency of scores in two populations. Useful for comparing
distributions across groups.
Positively skewed – if the long tail of the distribution goes right (skewness
> 0)
Negatively skewed – if the long tail of the distribution goes left (skewness
< 0)
Positive values of skewness indicate too many low scores in the
distribution, whereas negative values indicate a build-up of high scores.

Boxplots
Concise and informative way of presenting a frequency distribution. Uses
percentiles and medians. In a boxplot, you have the box, which consists of
the 25th percentile, median (50th percentile), and the 75th percentile. So it
has the middle 50% of scores. A larger box thus means that the middle

,50% of scores are more spread out. Under the 25th percentile, you have
scores that are in between the lowest non-outlying score and the 25th
percentile. Above the 75th percentile, you have scores that are in between
the 75th percentile and the highest non-outlying score. Outliers are shown
as dots in the picture. Scores are outliers when they are 1.5-3 box heights
from the box. An extreme score is more than 3 box heights from the box.
In the boxplot, you can also see skewness. When you look at the stick and
outliers, you can see if these are positively or negatively skewed.
Positively: longer stick and outliers at the bottom. Negatively: longer stick
and more outliers at the top. If the distribution is symmetrical, the sticks
are also symmetrical (of the same length).

Boxplots can be ugly. For example, the lower stick can be absent. This
occurs when 25% of the scores have the lowest possible score.

Note: you can only use boxplots when the variables are measured on
comparable scales. Furthermore, plots with small scales are almost
unreadable.

There are three types of boxplots:
1) 1-D Boxplot: a single boxplot of all scores for the chosen outcome
2) Simple boxplot: produces multiple boxplots for the chosen outcome by
splitting the data by a categorical variable. Useful for displaying different
boxplots on the same graphs for groups
3) Clustered boxplot: same as the simple boxplot, except that it splits data
by a second categorical variable.

Scatterplot
A graph that plots each person’s score on one variable against their score
on another. It visualises the relationship between the variables (you use it
to see if there is a correlation).
1) Simple scatter: to plot values of one continuous variable against
another. Just for looking at two variables. Although we can still talk about
predictors and outcomes, these terms do not imply causal relationships.
Here you can also ask SPSS for a regression line that summarizes the
relationship between variables in scatterplots.
2) Grouped scatter: like a simple scatter, except you can display points
belonging to different groups (a third categorical variable) in different
colours. You can also add a regression line for each group.
3) Simple dot plot (density plot): like a histogram, except that, rather than
having a summary bar representing the frequency of scores, individual
scores are displayed as dots. Like histograms, they are useful for looking
at the shape of the distribution of scores.
4) Scatterplot matrix: produces a grid of multiple scatterplots that are
showing the relationships between multiple pairs of variables in each cell
of the grid. Allows you to see the relationship between all combinations of
many different pairs of variables. Very convenient for examining pairs of
relationships between variables.
2. Compute statistics

,* Skewness: measure of asymmetry of the distribution. Can be negatively
or positively skewed. See picture above.

* Kurtosis: measure of peakedness of a distribution.
Perfectly normal = 0
Peak higher than normal > 0
Peak lower than normal < 0

Difference standard error and standard deviation:
Standard error: measure of accuracy of statistic
Standard deviation: measure of spread in the sample

Hypothesis testing with z-value
To test the null hypothesis that skewness = 0 or kurtosis = 0 you use the z
value. To obtain z value, you divide kurtosis or skewness by the standard
error. The further the value is from 0, the more likely it is that the data are
not normally distributed. Positive values of skewness indicate too many
low scores in the distribution, negative values indicate a build-up of high
scores. For kurtosis, positive values indicate a heavy-tailed distribution,
whereas negative values indicate a light-tailed distribution.
Kurtosis or skewness / std error (you can find this in SPSS descriptives
table) = z value. Z values for significance are: -1.96 and 1.96. Any z-score
above or below 1.96, is significant at the p<.05 level, 2.58 for the p<.01
level and 3.29 for the p<.001 level.

Hypothesis testing with Kolmogorov Smirnov test
This is a more direct normality test. It tests whether a distribution is
significantly different from normality. It compares the scores in the sample
to a normally distributed set of scores with the same mean and standard
deviation. If the KS tests are highly significant (at least p <.01), we have to
assume that none of the distributions are normal. You use this test with
Lilliefors correction in SPSS. So, you do not want significance in the KS
test, because then it tells us that the distribution of the sample is not
significantly different from a normal distribution.
SPSS: Explore – plots – tick ‘normality plots with tests’.

You can also use graphs to spot normality. For example the P-P plot plots
the cumulative probability of a variable against the cumulative probability
of a particular distribution. The actual z-score is plotted against the
expected z-score. If the data are normally distributed, then the actual z-
score will be the same as the expected z-score, and you’ll get a straight
diagonal line. So, if values fall on the diagonal of the plot then the variable
is normally distributed. When the data is consistently above or below the
diagonal then this shows that the kurtosis differs from a normal
distribution, and when the data points are S-shaped, the problem is
skewness. The Q-Q plot is the same as P-P, except it plots the quantiles of
the data instead of every individual score. It can be interpreted exactly the
same as a P-P plot.

, As sample size gets larger, the assumption of normality matters less. In
large samples, a test of normality is more likely to be significant. This is
the case for when group sizes are larger than 15 (we call this the central
limit theorem). So, in larger samples you should not do significance tests
and instead look at the shape of the distribution visually, interpret the
value of the skewness and kurtosis statistics, and possibly don’t even
worry about normality at all.

Hypothesis testing with Levene’s test
This test tells us whether the assumption of homogeneity of variances is
violated. If Levene’s test is significant, this means that it is violated and
equal variances are thus not assumed. If there are equal group sizes
(Nmax / Nmin < 1.5), F tests are robust for violation. For t-tests, you
always use the Welch’s t-test which always uses the equal variances are
not assumed option (further explanation in t-test part).

Lecture 1
2. Crosstabulation

Chi-square test
Investigate and test the relationship between two categorical (or nominal)
variables. You cannot use the mean or any similar statistic for categorical
variables. This test is based on the simple idea of comparing the
frequencies you observe in certain categories to the frequencies you might
expect to get in those categories by chance.
For example: to test the relationship between gender and kind of study
(alpha, beta or gamma). This statistic is based on the simple idea of
comparing the frequencies you observe in certain categories to the
frequencies you might expect to get in those categories by chance.

Can be used for testing the following null hypothesis:
H0: the two nominal variables are completely independent from each
other.
Chi squared is based on a comparison between observed cell frequencies
and expected cell frequencies if H0 is true. For example: if O (observed) is
10 and 30, than E (expected) would be 20 and 20.
It is based on the discrepancy that occurs for each cell between what one
would expect under independence and what was actually observed.

What are degrees of freedom?  the number of free observations. To
know if your chi squared value is significant, you have to look in a table.
Where you have to look in that table, is dependent on degrees of freedom
(You don’t actually have to look in this table, since we don’t do it by hand –
however, maybe you remember doing this in your bachelor). “There exists
a whole family of different chi squared distributions, dependent on
degrees of freedom” .

In chi squared test, df = (I – 1) * (J – 1). In which I and J are the numbers of
categories for each variable. You can also interpret it as df = (r – 1) * (c –
1) in which r is the number of rows and c is the number of columns.

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper saravandeven. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €12,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 69411 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis

Samenvatting

Summary of Applied Data Analysis - Lecture notes and Book (Field)

Document informatie

Onderwerpen

Gekoppeld boek

Meer samenvattingen voor studieboek

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud