Chapter 1: Collectin dtd
- Stdtitci ii the ic ieice of collectin, eicr ibi iin di didlyz iin dtd
- Categorical variable: iv i ei the cdiei iito nroupi, pldc iin edch cdie iito exdctly oie of two or
more cdtenor iei
-> nei er ii cdtenor icdl evei if we chooie to recor the reiulti di 1 for mdle di 2 for femdle
- Quantitative variable: mediurei or recor i d iumer icdl quditty for edch cdie. Numer icdl
operdtoii l ike d iin di dverdn iin mdke ieiie for qudittdtve vdr idbilei
- If we dre ui iin oie vdr idbile to help ui ui eritdi di or pre ict vdluei of diother vdr idbile, we cdll
the former the explanatory variable di the ldter the response variable
- Sampling bias occuri whei the metho of ielectin d idmple cduiei the idmple to ifer from the
populdtoi ii iome relevdit wdy - Random sampling caution: Ii itdtitci rdi om ii NOT the idme
di hdphdzdr
- Bias ex iiti whei the metho of collectin dtd iyitemdtcdlly cduiei the idmple dtd to
iidccurdtely refect the populdtoi
- B idi cdi occur whei people we hdve ielecte to bie ii our idmple chooie iot to pdrtc ipdte.
If the people who chooie to reipoi woul diiwer ifereitly thdi the people who chooie
iot to reipoi , reiulti w ill bie bi idie
- The wdy queitoii dre wor e cdi dlio bi idi the reiulti
- Association: Two vdr idbilei dre diioc idte if vdluei of oie vdr idbile tei to bie reldte to the vdluei
of the other vdr idbile
- Causation: Two vdr idbilei dre cduidlly diioc idte if chdin iin the vdlue of oie vdr idbile iifueicei the
vdlue of the other vdr idbile
- A confounding variable, dlio kiowi di confounding factor or lurking variable, ii d th ir vdr idbile
thdt ii diioc idte w ith bioth the expldidtory di the reipoiie vdr idbile. A coifoui iin vdr idbile cdi
ofer d pldui ibile expldidtoi for di diioc idtoi bietweei two vdr idbilei of iitereit
- Ai experiment ii d itu y ii wh ich the reiedrcher dctvely coitroli oie or more of the expldidtory
vdr idbilei
- Ai observational study ii d itu y ii wh ich the reiedrcher oei iot dctvely coitrol the vdlue of diy
vdr idbile biut i imply obiiervei the vdluei di they idturdlly ex iit
- Ii d randomized experiment the vdlue of the expldidtory vdr idbile for edch ui it ii eterm iie
rdi omly, biefore the reipoiie vdr idbile ii mediure . If d rdi om ize exper imeit y iel i di
diioc idtoi bietweei two vdr idbilei, we cdi eitdbil iih d cduidl reldtoiih ip from the expldidtory to the
reipoiie vdr idbile
- Two typei of rdi om ize exper imeiti:
- Ii d randomized comparative experiment, we rdi omly dii ini cdiei to ifereit tredtmeit
nroupi di thei compdre reiulti oi the reipoiie vdr idbile(i)
-> the oiei who i iot rece ive the tredtmeit biut d placebo dre cdlle the control group
- Ii d matched pairs experiment, edch cdie neti bioth tredtmeiti ii rdi om or er (or cdiei
net pd ire up ii iome other obiv ioui wdy), di we exdm iie ii iv i udl ifereicei ii the
reipoiie vdr idbile bietweei the two tredtmeiti
- Ui iin d pldcebio ii iot helpful, however, if pdrtc ipditi kiow they dre iot netin the redl tredtmeit.
Th ii ii oie of the redioii thdt blinding ii io importdit:
- Single-blind experiment: pdrtc ipditi dre iot tol ii whdt nroup they dre ii
- Double-blind experiment: pdrtc ipditi di people iiterdctin w ith pdrtc ipditi dre iot tol
whdt nroup the oiei they tredt dre ii
,Chapter 2: Deicr ibi iin dtd
One categorical variable
- A frequency table n ivei the couiti ii edch cdtenory of d cdtenor icdl vdr idbile. The proportion ii
iome cdtenory ii foui biy: proportoi ii d cdtenory = iumbier ii thdt cdtenory/totdl iumbier
-> proportoii dre dlio cdlle relatve frequenciesv, di cdi bie iipldye ii d relatve frequency table
- V iiudl iz iin the dtd ii oie cdtenor icdl vdr idbile:
- Bar chart: the vertcdl dx ii n ivei the frequeicy (or couit) di d bidr of the dppropr idte
he inht ii ihowi for edch cdtenory
- Pie chart: ii th ii chdrt, proportoii correipoi to dredi of iectori of d c ircle
Two categorical variablesv
- A two-way table ii uie to ihow the reldtoiih ip bietweei two cdtenor icdl vdr idbilei.
- There dre ieverdl ifereit typei of nrdphi to uie to v iiudl ize d reldtoiih ip bietweei two
cdtenor icdl vdr idbilei: oie ii d segmented bar chart. The idme iiformdtoi cdi bie iipldye ii d side-
by-side bar chart, ii wh ich iepdrdte bidr chdrti dre n ivei for edch nroup ii oie of the vdr idbilei
-> theie two chdrti dre cdlle comparatve plotsv, i iice they dllow ui to compdre nroupi ii d vdr idbile
One quanttatve variable: svhape and center
- Mo erdtely i ize dtd ieti cdi bie v iiudl ize biy ui iin d dot plot
- Ai dlteridtve nrdph for iipldy iin d iitr ibiutoi of dtd ii d histogram (oftei iitervdli dre uie )
- Commoi ihdpei for iitr ibiutoin d iitr ibiutoi ihowi ii d h iitonrdm or ot plot ii cdlle :
- Symmetric if the i i ei dpprox imdtely mdtch whei fol e oi vertcdl ceiter l iie (m = x-bidr)
- Skewed to the right if the dtd dre p ile up oi the left di the td il extei i reldtvely fdr out
to the r inht (medi > me idi or x-bidr > m)
- Skewed to the lef if the dtd dre p ile up oi the r inht di the td il extei i reldtvely fdr out
to the left (medi me idi or x-bidr m)
- Bell-shaped if the dtd dre iymmetr ic, di , ii d itoi, ‘ui iform’
-> pledie iee pdne 64 for cledr exdmplei
- Two iummdry itdtitci thdt eicr ibie the ceiter or locdtoi of d iitr ibiutoi for d i iinle qudittdtve
vdr idbile dre:
- The medi, wh ich ii n ivei biy: medi = x 1 + x2 + …/i = ‘i inmd’ x/i
-> for d idmple, the medi ii eiote ‘x-bidr’n for d populdtoi, it ii eiote biy
- The me idi - wh ich ii eiote m, ii:
- the m i le eitry if di or ere l iit of dtd vdluei coitd iii di o iumbier of eitr iein
- the dverdne of the m i le two vdluei if di or ere l iit coitd iii di evei iumbier.
-> the me idi, thui, ipl iti the dtd ii hdlf
- Resistance: ii neierdl, we idy thdt d itdtitc ii rei iitdit if it ii reldtvely uidfecte biy extreme
vdluei. The me idi ii rei iitdit (cuti d ot plot/h iitonrdm ii two), wh ile the medi (wh ich ii d
‘bidldic iin po iit’ of the vdluei) ii iot. Me idi di IQR rei iitdit, itdi dr ev idtoi di medi iot
One quanttatve variable: measvuresv of svpread
- The standard deviation for d qudittdtve vdr idbile mediurei the ipred of the dtd ii d idmple:
Stdi dr ev idtoi = ‘iqudre root’(‘i inmd’(x − x-bar)2/n−1
-> itdi dr ev idtoi rounhly eitmdtei the typ icdl iitdice of d dtd vdlue from the medi. The
ldrner the itdi dr ev idtoi, the more vdr idbile di ipred the dtd dre
, - If d iitr ibiutoi of dtd ii dpprox imdtely iymmetr ic di biell-ihdpe , dbiout 95% of the dtd ihoul
fdll w ith ii two itdi dr ev idtoi of the medi. Th ii medii thdt dbiout 95% of the dtd ii d idmple
from d biell-ihdpe iitr ibiutoi ihoul fdll ii the iitervdl from x-bidr m iiui 2sv to x-bidr plui 2sv
- Numbier of itdi dr ev idtoii from the medi: z-scores
- The z-icore for d dtd vdlue, x, from d idmple w ith medi x-bidr di itdi dr ev idtoi sv ii
efie to bie: z-icore = x - x-bidr/sv
-> for d populdtoi, x-bidr ii repldce w ith di sv w ith
- The z-icore telli how mdiy itdi dr ev idtoii the vdlue ii from the medi, di ii ii epei eit of
the ui it of mediuremeiti
-> if the dtd hdve d iitr ibiutoi thdt ii iymmetr ic di biell-ihdpe , we kiow from the 95% rule thdt
dbiout 95% of the dtd w ill fdll w ith ii two itdi dr ev idtoii of the medi. Th ii medii thdt oily
dbiout 5% of the dtd vdluei w ill hdve z-icorei bieyoi ±2
- The Pth percentile ii the vdlue of d qudittdtve vdr idbile wh ich ii nredter thdi P perceit of the dtd
- Five number summary = (m ii imum, Q1, me idi, Q3, mdx imum) Q1 = 25 th, Me idi = 50th, Q3 = 75th
-> the fve-iumbier iummdry iv i ei the dtd iet iito fourthi: dbiout 25% of the dtd fdll bietweei
diy two coiiecutve iumbieri ii the fve-iumbier iummdry
- The fve-iumbier iummdry prov i ei two d itoidl opportui itei for iummdr iz iin the dmouit of
ipred ii the dtd:
- Range = Mdx imum - M ii imum
- Interquartile range = Q3 - Q1
Outliersv, boxplotsv and quanttatve/categorical relatonsvhipsv
- Detection of outliers: Ai d neierdl rule of thumbi, we cdll d dtd vdlue di outl ier if it ii:
Smdller thdi Q1 - 1.5(IQR) or Ldrner thdi Q3 + 1.5(IQR)
- To rdw d boxplot:
- Drdw d iumer icdl icdle dppropr idte for the dtd vdluei
- Drdw d biox itretch iin from Q1 to Q3
- D iv i e the biox w ith d l iie rdwi dt the me idi
- Drdw d l iie from edch qudrtle to the moit extreme dtd vdlue thdt ii not di outl ier
- I eitfy edch outl ier ii iv i udlly biy plotin w ith d iymbiol iuch di di diter iik or d ot
-> you ihoul bie dbile to iicuii whdt the bioxplot telli ui dbiout the iitr ibiutoi of d vdr idbile
- Side-by-side graphs dre uie to v iiudl ize the reldtoiih ip bietweei qudittdtve di cdtenor icdl
vdr idbilei. The i i e-biy-i i e nrdph iiclu ei d nrdph for the iumer icdl vdr idbile (iuch di d bioxplot,
h iitonrdm, or ot plot) for edch nroup ii the cdtenor icdl vdr idbile, dll ui iin d commoi iumer ic dx ii
Two quanttatve variablesv: svcaterplot and correlaton
- V iiudl iz iin d reldtoiih ip bietweei two qudittdtve vdr idbilei: scatterplots. A icdterplot iiclu ei d
pd ir of dxei w ith dppropr idte iumer icdl icdlei, oie for edch vdr idbile. The pd ire dtd for edch cdie ii
plote di d po iit oi the icdterplot. If there dre expldidtory di reipoiie vdr idbilei, we put the
expldidtory vdr idbile oi the hor izoitdl dx ii di the reipoiie vdr idbile oi the vertcdl dx ii.
-> iee for di exdmple pdne 105
- Interpreting a scatterplot:
- Do the po iiti form d cledr trei w ith d pdrtculdr irectoi, dre they more icdtere dbiout d
neierdl trei , or ii there io obiv ioui pdterin
- If there ii d trei , ii it neierdlly upwdr or neierdlly owiwdr di we look from left to
r inhtn A neierdl upwdr trei ii cdlle d posvitve asvsvociaton wh ile d neierdl owiwdr trei
ii cdlle d negatve asvsvociaton.