Week 2: Descriptive Statistics
Descriptive statistics vs. inferential statistics
~ Descriptive statistics:
- Statistics used to describe (sample) data without further conclusions
- Measures of central tendency: Mean, median, mode
- Measures of variation (or spread): range, IQR, variance, standard deviation
~ Inferential statistics:
- Describe data of sample in order to infer patterns in the population
- Statistical tests: t-test, χ2-test, etc.
o Sample vs. population
Studying the whole population is (almost always) practically
impossible
Sample is a (selected) subset of population and thus more
accessible
Selection of representative sample is very important
(Types of) variables
~ Tabular representation of data:
- Each case is shown in a row
- Each variable is in a column
- Table
~ Nominal (categorical) scale: unordered categories
- Gender (frequently binary: two categories), Native language, etc.
~ Ordinal: ordered (ranked) scale, but amount of difference unclear
- Rank of English profiency (in class), Likert scale (Rate on a scale from 1 to
5...)
~ Interval scale: numerical with meaningful difference but no true 0
- Year of birth, temperature in Celsius
~ Ratio scale: numerical with meaningful difference and true 0
- Number of questions correct, age
Distribution of a variable
~ Normal distribution
-
- Has convenient characteristics
- Completely symmetric
- Read area: (about) 80%
- Read and green area: (about) 95%
~ Frequency of values (distribution of variables shows variability)
, - table(dat$english_grade)
~ Histogram (shows frequency of all value in groups
- hist(dat$english_grade, xlab = "English grade", main = "")
~ Density curve (shows area proportional to the relative frequency)
- plot(density(dat$english_grade), main = "", xlab = "English
grade")
- The total area under a density curve is equal to 1
- A density curve does not provide information about the frequency of one
value
o E.g., there might be no one who has scored a grade of exactly 6.1
- It only provides information about an interval
o E.g., more than 50% of the grades lie between 5.5 and 7.5
~ A distribution can also be characterized by measures of center and variation
- (skewness measures the symmetry of the distribution; not covered in this
course)
Measures of central tendency
~ Mode
- most frequent element (for nominal data: only meaningful measure)
- my_mode <- function(x) {
counts <- table(x)
names(which(counts == max(counts)))
}
my_mode(dat$english_grade)
~ Median
- when data is sorted from small to large, it is the middle value
- median(dat$english_grade)
~ Mean
- arithmetical average
- mean(dat$english_grade)
Measures of variation
~ Quantiles: cutpoints to divide the sorted data in subsets of equal size
- Quartiles: three cutpoints to divide the data in four equal-sized sets
o q1 (1st quartile): cutpoint between 1st and 2nd group
o q2 (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
o q3 (3rd quartile): cutpoint between 3rd and 4th group
- Percentiles: divide data in hundred equal-sized subsets
o q1 = 25th percentile
o q2 (= median) = 50th percentile
o Score at nth percentile is better than n% of scores
- quantile(dat$english_grade)
~ Minimum, maximum: lowest and highest value
- min/max(dat$english_grade)
, ~ Range: difference between minimum and maximum
- range(dat$english_grade)
- diff(range(dat$english_grade))
~ Interquartile range (IQR): q3 - q1
- IQR(dat$english_grade)
~ box plot is used to visualize variation of a variable
- Box (IQR): q1 (bottom), median (thickest line), q3 (top)
o (In example below, q1 and median have the same value)
- Whiskers: maximum (top) and minimum (bottom) non-outlier value
- Circle(s): outliers (> 1.5 IQR distance from box)
- boxplot(dat$english_grade, col = "red")
~ Deviation: difference between mean and individual value
~ Variance: average squared deviation
- Squared in order to make negative differences positive
- Population variance (with μ = population mean):
-
- As sample mean (xˉ) is approximation of population mean (μ), sample
variance formula contains division by n−1 (results in slightly higher
variance):
-
- var(dat$english_grade)
~ standard deviation: square root of variance
-
- sd(dat$english_grade)
Standardized scores
~ Standardization helps facilitate interpretation
- E.g., how to interpret: "Emma's score is 112" and "Tom's score is 105"
~ Interpretation should be done with respect to mean μ and standard deviation σ
- Raw scores can be transformed to standardized scores (z-scores or z-
values)
-
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper lamotte01. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €6,99. Je zit daarna nergens aan vast.