Methods & Statistics 1: Descriptive statistics, correlation, x 2 test
of independence
Descriptive statistics
Descriptive statistics and related plots are a succinct way of describing and summarising data but do
not test any hypotheses. There are various types of statistics that are used to describe data:
Measures of central tendency
Measures of dispersion
Percentile values
Measures of distribution
Descriptive plots
Central tendency can be defined as the tendency for variable values to cluster around a central
value. The three ways of describing this central value are mean, median or mode.
Mean, M or x̅ is equal to the sum of all the values divided by the number of values in the
dataset i.e. the average of the values. It is used for describing continuous data. However, it
can be influenced heavily by ‘extreme’ scores.
Median, Mdn is the middle value in a dataset that has been ordered from the smallest to
largest value and is the normal measure used for ordinal or non-parametric continuous data.
Less sensitive to outliers and skewed data.
Mode is the most frequent value in the dataset and is usually the highest bar in a distribution
histogram
Dispersion
Standard deviation, S or SD is used to quantify the amount of dispersion of data values
around the mean. A low standard deviation indicates that the values are close to the mean,
while a high standard deviation indicates that the values are dispersed over a wider range.
Variance is another estimate of how far the data is spread from the mean. It is also the
square of the standard deviation.
The standard error of the mean, SE is a measure of how far the sample mean of the data is
expected to be from the true population mean. As the size of the sample data grows larger
the SE decreases compared to S and the true mean of the population is known with greater
specificity.
MAD, median absolute deviation, a robust measure of the spread of data. It is relatively
unaffected by data that is not normally distributed. Reporting median +/- MAD for data that
is not normally distributed is equivalent to mean +/- SD for normally distributed data.
IQR - Interquartile Range is similar to the MAD but is less robust (75% - 25% quartile values)
Confidence intervals (CI) are a range of values within which you are n% confident the true
mean is included. A 95% CI is, therefore, a range of values that one can be 95% certain
contains the true mean of the population. This is not the same as a range that contains 95%
of ALL the values.
1
,Quartiles are where datasets are split into 4 equal quarters, normally based on rank ordering of
median values.
Distribution
Skewness describes the shift of the distribution away from a normal distribution. Negative skewness
shows that the mode moves to the right resulting in a dominant left tail. Positive skewness shows
that the mode moves to the left resulting in a dominant right tail.
Kurtosis describes how heavy or light the tails are. Positive kurtosis results in an increase in the
“pointiness” of the distribution with heavy (longer) tails while negative kurtosis exhibit a much more
uniform or flatter distribution with light (shorter) tails.
Currently, JASP produces four main types of descriptive plots:
Distribution plots is based on splitting the data into frequency bins, this is then overlaid with
the distribution curve
Correlation plot
Boxplots (visualize a number of statistics) – with 3 options
o Boxplot Element
o Violin Element
o Jitter Element
Q-Q plots (quantile-quantile plot) can be used to visually assess if a set of data comes from a
normal distribution
Correlation analyses
Correlation is a statistical technique that can be used to determine if, and how strongly, pairs of
variables are associated. Correlation is only appropriate for quantifiable data in which numbers are
meaningful, such as continuous or ordinal data.
2
, Standardized covariance is used: Pearson’s correlation coefficient (or "r"). It ranges from -1.0 to
+1.0. The closer r is to +1 or -1, the more closely the two variables are related. If r is close to 0, there
is no relationship. If r is (+) then as one variable increases the other also increases. If r is (-) then as
one increases, the other decreases (sometimes referred to as an "inverse" correlation).
Main assumption: data have a normal distribution and are linear
If you take the correlation coefficient r and square it you get the coefficient of determination (R 2).
This is a statistical measure of the proportion of variance in one variable that is explained by the
other variable. Or:
R2= Explained variation / Total variation R2 is always between 0 and 100% where:
0% indicates that the model explains none of the variability of the response data around its
mean and 100% indicates that the model explains all the variability of the response data
around its mean.
0 < negligible < 0.1 < small < 0.3 < moderate < 0.5 < large
Running non-parametric correlation – Spearman’s and Kendall’s tau
If your data is ordinal or is continuous data that has violated the assumptions required for parametric
testing (normality and/or variance) you need to use the non-parametric alternatives to Pearson’s
correlation coefficient. The alternatives are Spearman’s (rho) or Kendall’s (tau) correlation
coefficients.
Both are based on ranking data and are not affected by outliers or normality/variance
violations.
Spearman's rho is usually used for ordinal scale data and Kendall's tau is used in small
samples or when many values with the same score (ties).
In most cases, Kendall’s tau and Spearman’s rank correlation coefficients are very similar and
thus invariably lead to the same inferences.
The effect sizes are the same as Pearson’s r. The main difference is that rho 2 can be used as
an approximate non-parametric coefficient of determination but the same is not true for
Kendall’s tau.
Chi-square test for association (χ2) test for independence can be used to determine if a relationship
exists between two or more categorical variables. The test produces a contingency table, or cross-
tabulation, which displays the cross-grouping of the categorical variables.
The χ2 test checks the null hypothesis that there is no association between two categorical variables.
It compares the observed frequencies of the data with frequencies which would be expected if there
was no association between the two variables.
Chi-square test requires 2 assumptions:
The two variables must be categorical data (nominal or ordinal)
Each variable should comprise two or more independent categorical groups
Validity: χ2 tests are only valid when you have a reasonable sample size, that is, less than 20% of
cells have an expected count of less than 5 and none have an expected count of less than 1.
3
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller ilvyvandersteen. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.83. You're not tied to anything after your purchase.