RM | Unit 190 - Linear model with one dummy variable: testing two group means
Book: analyzing data using linear models
Chapter 6: 6.1, 6.2, 6.3, 6.4
Chapter 6.1: Dummy coding
→ numeric variables say something about how much of an attribute is in an object: for instance hight
(measured in inches) or heat (measured in degrees celsius).
→ categorical variables say something about the quality of an attribute: for instance colour (red, green,
yellow) or type of seating (aisle seat, window seat).
→ ordinal variables are somewhat in the middle between numeric and categorical variables: they are
bout quantitative differences between objects (e.g. size) but the values are sharp disjoint categories (small,
medium, large), and the values are not expressed using units of measurement.
Dummy coding - it involves making one or more new variables, that reflects the categorisation seen with
a categorical variable.
First, we focus on categorical variables with only two categories (dichotomous variables).
An example regarding dummy coding -
Imagine we study bus companies and there are two different types of seating
in buses: aisle seats and window seats. Suppose we ask 5 people, who have travelled
from Amsterdam to Paris by bus during the last 12 months, whether they had an aisle
seat or a window seat during their last trip, and how much they paid for the trip.
Suppose we have the variables person, seat and price.
With dummy coding, we make a new variable that only has values 0 and 1, that conveys the
same information as the seat variable. The resulting variable is called a
dummy variable. Let’s call this dummy variable window and give it the
value 1 for all persons that travelled in a window seat. We give the value 0 for
all persons that travelled in an aisle seat. We can also call the new variable
window a boolean variable with TRUE and FALSE, since in computer
science, TRUE is coded by a 1 and FALSE by a 0. Another name that is
sometimes used is an indicator variable. Whatever you want to call it, the data matrix including the new
variable is displayed in Table 6.2.
, What we have done now is coding the old categorical variable
seat into a variable window with values 0 and 1 that looks numeric.
Let’s see what happens if we use a linear model for the variables price
(dependent variable) and window (independent variable). The linear model is: →
Let’s use the bus trip data and determine the least-squares regression
line. We find the following linear equation:
If the variable window has the value 1, then the expected or predicted price of the bus ticket is,
according to this equation, 59 + 5 × 1 = 64. What does this mean? Well, all persons who had a window
seat also had a value of 1 for the window variable. Therefore the expected price of a window seat equals
64. By the same token, the expected price of an aisle seat (window = 0) is 59 + 5 × 0 = 59, since all those
with an aisle seat scored 0 on the window variable.
You see that by coding a categorical variable into a numeric dummy variable, we can describe the
’linear’ relationship between the type of seat and the price of the ticket. Figure 6.1 shows the relationship
between the numeric variable window and the numeric variable price
Chapter 6.2: Using regression to describe group means
If we take the least-squares regression line, this line goes
straight through the group means because then the sum of the
squared residuals is at its smallest value (the least-squares
principle).
When you know the group means, it is very easy to draw the regression line: the intercept is then the
mean for the category coded as 0, and the slope is equal to the mean of the category coded as 1 minus the
mean of the category coded as 0 (i.e. the intercept).
Chapter 6.4: regression analysis using a dummy variable in R
When your independent variable is a categorical variable, the code that you use in R is the same as with a
numeric independent variable. For instance, if you want to predict yield from the treatment group, you
could run the following R code:
data(“Plantgrowth”)
PlantGrowth %>%