Week 2: Descriptive Statistics
Descriptive statistics vs. inferential statistics
~ Descriptive statistics:
- Statistics used to describe (sample) data without further conclusions
- Measures of central tendency: Mean, median, mode
- Measures of variation (or spread): range, IQR, variance, standard deviation
~ Inferential statistics:
- Describe data of sample in order to infer patterns in the population
- Statistical tests: t-test, χ2-test, etc.
o Sample vs. population
Studying the whole population is (almost always) practically
impossible
Sample is a (selected) subset of population and thus more
accessible
Selection of representative sample is very important
(Types of) variables
~ Tabular representation of data:
- Each case is shown in a row
- Each variable is in a column
- Table
~ Nominal (categorical) scale: unordered categories
- Gender (frequently binary: two categories), Native language, etc.
~ Ordinal: ordered (ranked) scale, but amount of difference unclear
- Rank of English profiency (in class), Likert scale (Rate on a scale from 1 to
5...)
~ Interval scale: numerical with meaningful difference but no true 0
- Year of birth, temperature in Celsius
~ Ratio scale: numerical with meaningful difference and true 0
- Number of questions correct, age
Distribution of a variable
~ Normal distribution
-
- Has convenient characteristics
- Completely symmetric
- Read area: (about) 80%
- Read and green area: (about) 95%
~ Frequency of values (distribution of variables shows variability)
, - table(dat$english_grade)
~ Histogram (shows frequency of all value in groups
- hist(dat$english_grade, xlab = "English grade", main = "")
~ Density curve (shows area proportional to the relative frequency)
- plot(density(dat$english_grade), main = "", xlab = "English
grade")
- The total area under a density curve is equal to 1
- A density curve does not provide information about the frequency of one
value
o E.g., there might be no one who has scored a grade of exactly 6.1
- It only provides information about an interval
o E.g., more than 50% of the grades lie between 5.5 and 7.5
~ A distribution can also be characterized by measures of center and variation
- (skewness measures the symmetry of the distribution; not covered in this
course)
Measures of central tendency
~ Mode
- most frequent element (for nominal data: only meaningful measure)
- my_mode <- function(x) {
counts <- table(x)
names(which(counts == max(counts)))
}
my_mode(dat$english_grade)
~ Median
- when data is sorted from small to large, it is the middle value
- median(dat$english_grade)
~ Mean
- arithmetical average
- mean(dat$english_grade)
Measures of variation
~ Quantiles: cutpoints to divide the sorted data in subsets of equal size
- Quartiles: three cutpoints to divide the data in four equal-sized sets
o q1 (1st quartile): cutpoint between 1st and 2nd group
o q2 (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
o q3 (3rd quartile): cutpoint between 3rd and 4th group
- Percentiles: divide data in hundred equal-sized subsets
o q1 = 25th percentile
o q2 (= median) = 50th percentile
o Score at nth percentile is better than n% of scores
- quantile(dat$english_grade)
~ Minimum, maximum: lowest and highest value
- min/max(dat$english_grade)
, ~ Range: difference between minimum and maximum
- range(dat$english_grade)
- diff(range(dat$english_grade))
~ Interquartile range (IQR): q3 - q1
- IQR(dat$english_grade)
~ box plot is used to visualize variation of a variable
- Box (IQR): q1 (bottom), median (thickest line), q3 (top)
o (In example below, q1 and median have the same value)
- Whiskers: maximum (top) and minimum (bottom) non-outlier value
- Circle(s): outliers (> 1.5 IQR distance from box)
- boxplot(dat$english_grade, col = "red")
~ Deviation: difference between mean and individual value
~ Variance: average squared deviation
- Squared in order to make negative differences positive
- Population variance (with μ = population mean):
-
- As sample mean (xˉ) is approximation of population mean (μ), sample
variance formula contains division by n−1 (results in slightly higher
variance):
-
- var(dat$english_grade)
~ standard deviation: square root of variance
-
- sd(dat$english_grade)
Standardized scores
~ Standardization helps facilitate interpretation
- E.g., how to interpret: "Emma's score is 112" and "Tom's score is 105"
~ Interpretation should be done with respect to mean μ and standard deviation σ
- Raw scores can be transformed to standardized scores (z-scores or z-
values)
-
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller lamotte01. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $7.49. You're not tied to anything after your purchase.