Lectures advanced statistics
Recap lecture continuous outcomes
• Comparison of 2 groups: t- test
• Comparison of more than 2 groups: ANOVA
• Linear regression analysis
• Non-normally distributed outcome variables
G ro u p S ta tistic s
Comparison of 2 groups: T-test S td . E rro r
o When you have a continuous outcome variable and two sex N M ean S t d . D e via t io n M ean
c h o le s t e ro l in m m o l/ l fe m a le 47 4 ,5 4 5 3 ,8 3 9 5 7 ,1 2 2 4 6
groups m a le 53 4 ,8 6 4 5 ,7 5 4 5 4 ,1 0 3 6 4
o Cross-sectional cohort study, Outcome variable:
cholesterol, 100 persons In d e p e n d e n t S a m p le s T e st
o What is the difference in cholesterol t - t e s t fo r E q u a l i t y o f M e a n s
concentration between males and females? 9 5 % C o n fid e n c e
In t e r v a l o f t h e
o The first part of the output is descriptive M ean S td . E rro r D i f fe r e n c e
t df S ig . ( 2 - t a ile d ) D i f fe r e n c e D i ff e r e n c e Low er U pper
information. The cholesterol concentration in c h o le s t e ro l in m m o l/ l -2 ,0 0 3 98 ,0 4 8 -,3 1 9 2 1 ,1 5 9 4 0 -,6 3 5 5 4 -,0 0 2 8 8
the males (4,5) is higher than in the females
(4,8)
o The effect estimate is basically the difference in cholesterol concentration. You can find that under mean
difference.
Comparison of more than 2 groups:
o What is the relationship between cholesterol and alcohol consumption?
o Three groups
a. Non drinkers
b. Moderate drinkers (1-2 glasses per day)
c. Heavy drinkers (> 2 glasses per day)
o Comparing three mean values -> Analysis of variance (ANOVA)
o You will get mainly p-values. It is statistical testing. It is not much used.
Linear regression analysis:
A linear regression comparing two groups with each other is the same as an independent t-test. Both the
difference between two groups and the difference between three groups can be analyzed with linear
regression analysis
What is the relationship between cholesterol and age? Age is a continuous determinant
When you have a continuous determinant and continuous outcome you make a scatterplot: we make a line
through the dots. The ‘Best’ line is estimated with the ‘least squares method’ -> Distance between the
observed points and the estimated line is as small as possible
o Y=b0 + b1 * X
o B0= constant, intercept
The interpretation of the regression coefficients is always the same:
cholesterol = 3.859 + 0.021 * age
- suppose you fill in zero for age. The b0 is the value for the outcome when the independent variable equals
zero, in this situation when age equals zero (this is often not informative)
- b1: when age differs 1 unit, then the outcome, cholesterol, differs with
regression coefficient unit / when age differs 1 unit, then cholesterol differs
with 0,021 unit. 0,021 is the difference in cholesterol concentration in persons
aged 60 and 61.
- We assume a linear relationship between age and cholesterol: every step in age
has the same influence on cholesterol. This is an assumption we have to check
1
,The b0 is the value of the outcome when the independent variable equals zero, so the b0 is the average
cholesterol for the females and the b1 is the difference between the males and females.
b0: 4,545: average cholesterol concentration in the females
b1: difference in cholesterol between males and females.
C o e f fi c i e n t sa
In linear regression we can consider confounding and effect modification.
U n s t a n d a r d iz e d S t a n d a r d iz e d
We cannot do this with an independent t-test. C o e ffic i e n t s C o e ffi c ie n t s
M odel B S td . E rro r B e ta t S ig .
1 (C o n s ta n t) 4 ,5 4 5 ,1 1 6 3 9 ,1 6 8 ,0 0 0
Back to the example with 3 groups, Mean and SD: sex ,3 1 9 ,1 5 9 ,1 9 8 2 ,0 0 3 ,0 4 8
Non-drinkers: 4.86 (0.63), coded 0 a . D e p e n d e n t V a r i a b l e : c h o l e s t e r o l i n m m o l/ l
Moderate drinkers 1-2 glasses per day): 4.29 (0.77), coded 1
Heavy drinkers (>2 glasses per day): 5.18 (0.83), coded 2
the moderate drinkers have the lowest concentration and the heavy the highest concentration. The
nondrinkers are in between.
C o e ffic ie n tsa
U n s t a n d a rd iz e d S t a n d a rd iz e d
If we put this variable in our model we get this output: M odel B
C o e ffi c i e n t s
S t d . E rro r
C o e ffic ie n t s
B e ta t S ig .
This is not what we expect: the constant should be the estimate for the 1 (C o n s ta n t)
a lc o h o l c o n s u m p t io n
4 ,6 5 4
,0 7 4
,1 1 9
,1 0 6 ,0 7 0
3 9 ,2 5 3
,6 9 8
,0 0 0
,4 8 7
cholesterol concentration when the alcohol consumption equals zero, so a . D e p e n d e n t V a ria b le : c h o le s t e ro l in m m o l/ l
the non-drinkers. But the average is not equal to 4,654.
When alcohol consumption differs with 1 unit, then the outcome variable differs with 0,074 units. The
estimated difference in cholesterol concentration between the moderate and the non-drinkers is 0,074 in favor
of the non-drinkers. We estimate that the moderate drinkers have a higher cholesterol than the non-drinkers.
That does not make sense because the moderate drinkers have a lower cholesterol concentration than the
non-drinkers. What goes wrong: we put alcohol consumption as a continuous variable in the model, assuming a
linear relation. But based on the descriptive information, there is no linear relation. First it goes down and then
it goes up again. You should create dummy variables.
Dividing the continuous determinant into dummies, linear relationship with outcome?
Yes - determinant continuously in model
No - determinant dummy’s in model
Alcohol consumption is not a continuous variable but a categorical C a te g o ric a l V a ria b le s C o d in g s
variable and has to be analysed with dummy variables P a r a m e t e r c o d in g
F re q u e n c y (1 ) (2 )
a lc o h o l n o n d r in k e r 40 ,0 0 0 ,0 0 0
Number of categories of the variable minus one is the number of c o n s u m p t io n 1 -2 g la s s e s p e r d a y 38 1 ,0 0 0 ,0 0 0
dummies. So here, we should create two dummies. > 2 g la s s e s p e r d a y 22 ,0 0 0 1 ,0 0 0
C o e ffi c ie n tsa
the b0 goes up for the reference category: non-drinker. Meaning,
U n s t a n d a rd iz e d S t a n d a rd iz e d
that the 4,863 is the estimation of the average cholesterol C o e f fi c i e n t s C o e f fi c i e n t s
concentration for the non-drinkers. M odel
1 (C o n s ta n t)
B
4 ,8 6 3
S td . E rro r
,1 1 6
B e ta t
4 2 ,0 1 7
S ig .
,0 0 0
d u m m y a l1 -,5 7 5 ,1 6 6 -,3 4 7 -3 ,4 6 8 ,0 0 1
d u m m y a l2 ,3 1 8 ,1 9 4 ,1 6 4 1 ,6 3 9 ,1 0 5
-0,575 is the difference in cholesterol concentration between the a . D e p e n d e n t V a ria b le : c h o le s t e ro l in m m o l/ l
moderate drinkers and the non-drinkers. So, the moderate drinkers
have a lower concentration.
0,318 is the difference in cholesterol concentration between the heavy drinkers and the
non-drinkers. So, the heavy drinkers have a higher concentration.
You compare a group with a reference category. There is no estimation for the
difference between heavy and moderate drinking. You can fill in the regression equation
and you can calculate the difference but you cannot calculate the standard error.
When you do not have the SE you can also not obtain the t, 95%BI and p value. You
should recode your dummies.
2
,Next example:
What is the relationship between cholesterol and BMI?
C o e ffi c ie n tsa
BMI is a continuous determinant
U n s t a n d a rd iz e d S t a n d a rd iz e d
C o e ffic ie n t s C o e ffic ie n t s
M odel B S t d . E rro r B e ta t S ig .
Regression coefficient refers to a difference of 1 unit in 1 (C o n s ta n t) 3 ,8 6 5 ,5 1 9 7 ,4 4 8 ,0 0 0
b o d y m a s s in d e x ,0 8 6 ,0 1 8 ,3 1 6 4 ,6 8 5 ,0 0 0
BMI! a . D e p e n d e n t V a ria b le : t o t a a l c h o le s t e ro l
B0: the estimated/average cholesterol concentration when BMI is zero (non-informative number, there is never
a subject with a BMI of 0)
B1: when there is a difference of one unit in BMI, there is a difference of 0,086 unit in cholesterol. Every step in
BMI has the same effect on cholesterol.
Often you do not report the difference for 1 step in BMI but for 5 steps in BMI:
OR for multiple units: regression coefficient x 5 and then take the e power
BI at multiple units in OR: regression coefficient x 5 and z x 5, then take the power of both boundaries.
(5 * b1 +- 5 * 1,96 * SE) -> fill in on the calculator at once and then take the e power
Non-normally distributed outcome variables:
When you talk about average values, it only has a good interpretation if it is normally distributed. A skewed to
the right outcome variable can be log transformed.
Difference between the groups in ln(triglyceride) = 0.127. It is C o e ffi c ie n tsa
the difference between males and females in the outcome U n s t a n d a rd iz e d S t a n d a rd iz e d
C o e ffi c i e n t s C o e ffi c i e n t s
variable and the outcome variable is the natural log of M odel B S t d . E rro r B e ta t S ig .
triglycerides. Natural log of triglycerides differs 0,127 between 1 (C o n s ta n t) -,4 1 3 ,0 4 6 -8 ,9 1 3 ,0 0 0
sex ,1 2 7 ,0 6 6 ,1 1 2 1 ,9 4 1 ,0 5 3
males and females. It is not informative to report the difference
a . D e p e n d e n t V a r ia b le : ln _ t rig
in natural log values. We have to retransform it. You can take
the e power.
EXP[0.127] = 1,14: Males have a 1.14 higher triglyceride concentration than the females.
3
,Recap lecture dichotomous outcomes
• Comparison of 2 groups: chi-square test (/ fisher exact test)
• Comparison of more than 2 groups: chi-square test (/ fisher exact test)
• Logistic regression analysis
• Confounding and effect modification
Comparison of 2 groups:
New medicine for pain relief, 100 patients, Follow-up time: 1 year
Recovery, defined as a dichotomous variable (recovered versus not recovered)
Three measures of effect:
RD = 35% - 20% = 15%
RR = 35% / 20% = 1.75
OR = 35 * * 65 = 2.15 Chi-Squar e Te sts
Asym p. Si g.
1 test: Chi-square test Value df (2 -s ided)
Pea rs on Ch i-Sq ua re 1 0,01 3a 2 ,007
Line a r-by-Li ne ar
5 ,182 1 ,023
Comparison of more than 2 groups: Ass ocia tion
N o f Val id Cas es 3 00
Alternative medicine is added to the experimental study. Three groups to a . 0 ce lls (,0 %) h ave expe cted cou nt less tha n 5. The
m in im um expected count is 3 1,67 .
compare
Logistic regression analysis:
Both the differences between two groups and the differences between three groups can also be analysed with
logistic regression analysis.
We want to create something more or less the same as linear regression but then with a dichotomous
outcome. To make that transition you have to consider that the assumption of the linear regression model that
the outcome is continuous and normally distributed. What we have to do is transforming the dichotomous
outcome in something that is continuous and normally distributed.
1. Dichotomous outcome is not normally distributed
2. So, we look at the probability on the outcome, Y=1. But this is still not really continuous as it goes from
zero to one and it is not normally distributed but binomial.
3. Therefore, we take the relative probability on the outcome: the odds. The odds are more or less
continuous because the lowest value equals zero and the highest infinity. But the odds are skewed to
the right
4. That is why we take the natural log of the odds on the outcome.
Now we are interested in a dichotomous outcome
Y = b0 + b1X
In fact, we are interested in the probability of Y=1 as outcome
p (Y = 1 ) = b 0 + b 1 X
The probability can only lie between 0 and 1 and has a binomial distribution. Solution: take the odds (Y=1) as
outcome (the probability of the outcome divided by 1 minus the probability of the outcome)
p (Y = 1 )
= b0 + b1 X
1 - p (Y = 1 )
This is better, but the odds is skewed to the right. Solution: take the natural log of the odds
æ p (Y = 1 ) ö
ln çç ÷÷ = b 0 + b 1 X
è 1 - p ( Y = 1 ) ø
4
,Now we have a continuous variable which is also normally distributed
Logistic model; transform to probability
1
p (Y = 1 ) = - (b 0 + b 1 X )
1 + e
You can transform the logistic model into the probability of an outcome (voor de – boven de e niet de aftrek
min op rekenmachine invullen maar de min van een negatief getal)
Example with dichotomous determinant:
Case-control study investigating several determinants for
myocardial infarction (MI)
Question: Is smoking a determinant for having MI?
The odds of having a myocardial infarction for smokers is 2,225 times as high as the odds of having a
myocardial infarction for non-smokers
Natural log of the odds:
0,80: indicates the difference in the outcome between smokers and non-smokers. It is the natural log of the
odds of having a myocardial infarction.
Wald test = (b/SE(b))2
Chi-square distributed with 1 degree of freedom (0..245) 2 = 10.623
Critical value of chi square with 1 df= 3,84
smokers: ln(odds) = b0 + b1: the natural log of the odds for a myocardial infarction
non-smokers: ln(odds) = b0: the natural log of the odds for a myocardial infarction
How do we obtain an odds ratio and the 95% CI from the output of a logistic regression analysis? You should
take the e power:
OR = EXP (0.800) = 2.23
95% Confidence interval: (b ± 1.96 x SE(b)) Then you take the e power of both ranges. 95% CI: 1.38 - 3.60
95% CI and statistical significance:
Lineair: if it includes 0, it is not significant
Logistic and Cox: if it includes 1, it is not significant
Probability of MI for a smoker and a non-smoker?
For non-smokers:
1
P (Y = 1 ) = - (- 0 .1 7 1 )
= 0 .4 6
1 + e
For smokers:
1
P (Y = 1 ) = - (- 0 .1 7 1 + 0 .8 0 0 ) = 0 .6 5
1 + e
5
,Likelihood-ratio test:
Likelihood = Product of all probabilities for each person given the values of the determinants
Can be calculated based on the 2x2 table and the estimated probabilities (given the value of the determinants).
Divide it by the row total:
1: 65 %= 60: 92 0,659
2: 35%= 32: 92= 0,347
3: 46%= 150: 28= 0,457
4: 54%= 178: 328= 0,54
Likelihood:
(0.65)60 x (0.35)32 x (0.46)150 x (0.54)178
Likelihood is very small –2 log likelihood
Model with smoking:
- 2 ln [(0.65)60 x (0.35)32 x (0.46)150 x (0.54)178] = 571.20
Model without determinants: (50% cases, so probability of MI = 0.5):
- 2 ln [(0.5)420] = 582.24
In the model without the determinants, always look at the ratio of cases to total to determine the
probability of outcome. people with the outcome (cases): total research
How do you do this in your calculator -2 ln: -2 x ln ((….))
It gives you an indication how important smoking is in relation to the prediction of myocardial infarction.
- Comparing two models with each other
- Difference in -2 log likelihoods
- Difference follows a chi-square distribution
- The number of degrees of freedom is equal to the difference in number of estimated parameters of the
two models that are being compared.
- Essential: one model must be an extension of the other model
- The absolute value of the -2 log likelihood is meaningless; it is only important in relation to the likelihood-
ratio-test
Example: 582.24 - 571.20 = 11.04. Chi-square distributed with 1 degree of freedom.
Logistic regression analysis with a categorical determinant is the same as for linear regression analysis
Logistic regression analysis with a continues determinant is the same as for linear regression analysis
Example: Varia ble s in the Equation
OR for BMI = EXP [.177 ± 1.96 x .038] = 1.19 [1.11 - 1.29]
B S.E. Wa ld df Sig . Exp (B)
This is the OR for a difference of 1 unit in BMI Step
a
BMI ,177 ,038 2 1,4 91 1 ,000 1 ,19 4
1 C on s tant -4,2 35 ,918 2 1,2 64 1 ,000 ,0 14
a. Variab le (s ) en tered o n step 1: BMI.
More interesting is the OR for a difference in e.g. 5 units:
OR for 5 units BMI
Multiply the regression coefficient by 5 and then take the e power
95%CI for 5 units BMI
= EXP [5 x .177 ± 5 x 1.96 x .038]
= 2.43 [1.67 - 3.53]
Check for linearity: every step in BMI has the same increase in outcome variable.
- As BMI is a continuous determinant, we need to check whether the relation is linear
2 ways:
- Mathematical function/ dividing the continuous determinant into groups
Mathematical function:
Var iables in the Equation
6 B S.E. Wa ld df Sig. Exp(B)
Step
a
BMI 1,689 ,501 11,374 1 ,001 5 ,414
1 BMI2 -,031 ,010 9,30 1 1 ,002 ,970
C on s tan t -2 2,682 6,206 13,356 1 ,000 ,000
a. Variab le (s ) entered on step 1: BMI, BMI2 .
,we add a squared term for BMI, we model a quadratic relationship between BMI and having a myocardial
infarction. BMI2 is just BMI multiplied by BMI. How can we evaluate whether this model is better than a linear
relationship? If this model is not better we can just stick with the linear relationship. We can do that by looking
at the p value of the squared term. If the p value is significant the nul hypothesis has to be rejected: the
regression coefficient for BMI squared equals zero.
When it is significant then the regression coefficient for the squared term is not equal to zero, therefore BMI
square is a statically important new variable. That indicates that the quadratic development over time is a
better way to describe the relationship between BMI and myocardial infarction is better than just the linear
relationship. If it was not statically significant, the nul hypothesis was not rejected and then statically the
regression coefficient from bmi2 is not different from 0, then the squared term is not important and therefore
the quadratic relationship is not important and we go back to the linear relationship.
Dividing the continuous determinant into groups:
A drawback is that the interpretation of the regression coefficients is
different. So, a better option is to make quartiles, 4 groups for BMI. We add 3 Variables in the Equa tion
B S.E. Wa ld df Sig. Exp(B)
dummy variables to the model and see whether there is a linear trend in the Step
a
NBMI 33,663 3 ,000
1 NBMI(1) ,908 ,296 9,422 1 ,002 2,480
regression coefficients. NBMI(2) 1,664 ,303 30,062 1 ,000 5,281
NBMI(3) 1,335 ,298 20,092 1 ,000 3,800
Con s tant -,985 ,221 19,793 1 ,000 ,373
The lowest/first quartile will be the reference category. There is a 0,908 a. Variable(s ) entered on step 1: NBMI.
difference between subjects in the second quartile and the first. We have a
regression coefficient of 1,664 for subjects in the third quartile, compared to the first quartile. And a regression
coefficient of 1,335 for subjects in the fourth quartile, compared to the first quartile.
If you look at the regression coefficients. The first two have more or less a linear trend. But in the third dummy
there is a decrease. So that indicates that there is never a linear relationship. you have to keep the dummies
here and you should not analyze the determinant continuously.
Confounding and effect modification:
The question is whether sex is an effect modifier or a confounder in the
relationship between BMI and MI
To investigate effect modification, you should introduce an interaction
term (possible em + central determinant) How can we evaluate whether
or not there is an interaction between BMI and sex? You look at the p
value of the interaction term. When it is significant there is effect
modification.
When you investigate effect modification by a categorical variable with >2
categories you make an interaction term for each dummy. To look if
there is effect modification you look at the p value of the overall
dummy. You look at the overall wald test.
For confounding, we compare a model with the confounder with a
model without the confounder. there should be a 10% difference in
the regression coefficient of the central determinant. Nieuw- oud :
oud. We look at the three coefficients: the first dummy changes
more than 10% so we directly see that here is confounding. Either
we keep all the dummies or we throw them all out. If there is a 10%
difference in one of the dummies, there is confounding.
In the model you have the central determinant, the EM, and an
interaction (for each dummy)= EM * determinant
7
,Recap lecture survival data
• Introduction
• Comparison of 2 groups
• Comparison of more than 2 groups
• Cox regression analysis
• Proportional hazards
Introduction:
Dichotomous outcome variable + a time indicator can be described using a ‘survival’ curve
S(t) = probability to survive in period t, given the probability of starting that period alive
S(t) = S(t-1) * survival fraction
Dichotomous outcome and also the time when the dichotomous outcome occurs.
The event and the time to event.
Comparison of 2 groups:
Back to the example with ‘recovery’ as outcome
We have survival data which can be shown is a Survival curve. A survival curve is
also known as Kaplan Meier curve.
The pink one is better than the green one.
If you have a positive outcome, recovery for example, a low curve is good.
If you have a negative outcome, mortality for example, a low curve is bad.
In the beginning the 2 groups are more or less the same. After month 5 the new
medication group becomes better than the placebo group. We can perform a statistical test to compare the 2
curves: log rank test. Chi square distributed. Nul hypothesis would be that the curves are the same. When the
log rank test is significant, the curves are significantly different. It is not possible to obtain an effect estimate
from the curve or log rank test. We cannot say anything how much better the medication group is compared to
the placebo group. For this, we can do a cox regression analysis.
Comparison of more than 2 groups:
The brown one goes down immediately. There is a fast decline in the probability
of not recovering.
Cox regression analysis:
- Both the differences between two groups and the differences between three
groups can be analysed with Cox-regression analysis
- Cox-regression analysis is comparable to linear and logistic regression
analysis, only the outcome variable looks different
- We do not analyze the survival, we analyze the hazard: Ln[hazard] = baseline
hazard + b1 * group
there is no b0, it is a function over time. Therefore, it is not in the
output.
Group is the difference in the outcome for the 2 groups. The outcome is the
natural log of the hazard function. The regression coefficient is basically the
difference in natural log of the hazard function.
Due to the characteristics of the (natural) logarithm, the regression coefficient (b) can be transformed into a
hazard ratio by taking the e power. HR = EXP(b)
8
,The hazard ratio is comparable to a relative risk, but it is not the same
The hazard ratio is a combination of the ratio between the number of recovered patients and the ratio of the
time to recovery between the groups.
Cox regression analysis can also be used in a situation in which three groups are
compared. With dummy variables, we can compare the two medication groups
with the placebo group.
placebo is the reference category. Dummy 1 is the medication and dummy 2 is
the alternative medication group. For both regression coefficients we can
obtain HR by taking the e power. And we can also calculate the 95%CI
Of course, the same applies for the relationship between a dichotomous
variable + time and a continuous variable. Effect measure is the hazard ratio for
a difference in one unit of the continuous determinant
Example: V a ria b le s in th e E q u a tio n
Cohort study to investigate the relationship between BMI and 9 5 , 0 % C I fo r
E x p (B )
mortality. The natural log of the hazard function and the OR are B S E W a ld df S ig . E x p (B ) Low er U pper
BM I ,0 2 3 ,0 0 9 5 ,7 2 0 1 ,0 1 7 1 ,0 2 3 1 ,0 0 4 1 ,0 4 2
for one unit difference in BMI.
Converting the HR:
For each unit of BMI, the HR is 1.023
For 5 units of BMI the HR is: EXP [5 x 0.023] ± 1.96 x [5 x 0.009] = 1.12 [1.02 – 1.23]
For 5 units of BMI the HR is: (1.023)5 = 1.12 [1.02 – 1.23] V a ria b le s in th e E q u a tio n
9 5 , 0 % C I fo r
E x p (B )
We have a continuous variable so we should check linearity: B SE W a ld df S ig . E x p (B ) Low er Upper
We only have to look at the trend in the regression coefficients. From the N B M I 6 ,6 0 8 3 ,0 8 5
N B M I( 1 ) ,0 2 6 ,3 4 1 ,0 0 6 1 ,9 3 9 1 ,0 2 7 ,5 2 6 2 ,0 0 3
first to the second quartile, the increase is minimal. The increase starts in N B M I( 2 ) ,5 0 2 ,3 3 8 2 ,1 9 9 1 ,1 3 8 1 ,6 5 1 ,8 5 1 3 ,2 0 4
N B M I( 3 ) ,7 4 3 ,3 4 8 4 ,5 6 2 1 ,0 3 3 2 ,1 0 1 1 ,0 6 3 4 ,1 5 4
the third and fourth quartile. There is not really linearity.
The assumption of linearity (when you have a continuous independent variable) goes up for linear, logistic
and cox regression
Proportional hazards:
If you report one HR, the HR is on average over time. We assume that that the
ratio reflects the ratio of the two curves over the whole period. In some
situations, this will not be the case. If you report 1 HR: the HR is an
overestimation for the first part of the period and an underestimation of the
second part of the study.
We can test if it makes sense to report 1 HR: Investigate whether the
proportional hazards assumption holds by adding an interaction between
therapy (independent variable) and time to the model.
Interaction = central determinant x time.
The curve is about mortality. If the curve goes down, the probability of
survival is very low. After 40 weeks the pink curve is much better than the
green curve. Having 1 HR in this situation is not good, in the beginning the
effect of the chemo Is not that strong but in the last part it is very strong.
We add an interaction term which is made out of the therapy variable and a time variable which is
dichotomized in two parts. We add the interaction to the model and we can see whether the interaction is
statistically significant. When the interaction is significant, the proportional hazard assumption is not met.
9
, When we just add the chemo variable to our model: on average
over time, the probability of dying in the chemo therapy group is
0,492 times as high as the probability of dying in the no
chemotherapy group.
Compared to effect modification, here you only add the interaction,
while in effect modification you add the possible effect modifier and
the interaction.
We check if we can report this on average over time HR. we split the
time period in two parts based on the curve. The interaction term is
T_COV multiplied with the central determinant. You look if it is significant. If it is significant that means that the
regression coefficient for chemotherapy is significantly different before and after 40 weeks. The assumption of
proportional hazard does not hold so we have to report 2 HR. one for before 40 weeks and one for after 40
weeks.
The regression coefficient of -2,139 is the difference in outcome between the two groups but only when the
interaction indicates zero. The HR of 0,118 indicates that, on average over time, after 40 weeks, the probability
of dying is 0,118 times as high for chemotherapy as the probability of dying in the non-chemotherapy group.
The HR for before 40 weeks has to be calculated by summing up the 2 coefficients. So, we have -2,139 + 1,695
gives us the regression coefficient and we can take the e power out of it. If you want to calculate confidence
intervals and if you want to have p values for before 40 weeks you have to recode you time dichotomous
variable.
So, we have 2 HRs:
One for the period before 40 weeks
One for the period after 40 weeks
HR> 40 = EXP[-2.139] = 0.118
HR< 40 = EXP[-2.139 + 1.695] = 0.641
Confounding and effect modification can be investigated in the same way as in linear and logistic regression
analysis
V a r ia b le s in th e E q u a tio n
Example: 9 5 , 0 % C I fo r
Cohort study investigating the relationship between BMI and mortality B S E W a ld df S ig . E x p (B ) Low er
E x p (B )
U pper
Question: Is gender a confounder in the relation between BMI and N B M I 6 ,6 0 8 3 ,0 8 5
N B M I( 1 ) ,0 2 6 ,3 4 1 ,0 0 6 1 ,9 3 9 1 ,0 2 7 ,5 2 6 2 ,0 0 3
mortality? N B M I( 2 ) ,5 0 2 ,3 3 8 2 ,1 9 9 1 ,1 3 8 1 ,6 5 1 ,8 5 1 3 ,2 0 4
N B M I( 3 ) ,7 4 3 ,3 4 8 4 ,5 6 2 1 ,0 3 3 2 ,1 0 1 1 ,0 6 3 4 ,1 5 4
We just compare the regression coefficients of the output with the
regression coefficients adjusted for the confounder. If there is an
influence of 10% in one of the dummies, there is relevant confounding
by sex.
Question: Is gender an effect modifier in the relation between BMI and
mortality?
We have 3 dummy variables so we create 3 interaction terms. You look at
the p value of the overall wald test of the interaction term. If it is significant
you have to report separate results for males and females.
Samengevat: regressie analyses:
Uitkomst continu: lineaire regressie
• Y^ = B0 + B1 * X
• B1 geeft verschil van gemiddelden
10