ALL ABOUT ISYE 6501 WEEK 8 COMPLETE SOLUTION
Question 11.1
Using the crime data set from Questions 8.2, 9.1, and 10.1, build a regression model using:
1. Stepwise regression
2. Lasso
3. Elastic net
For Parts 2 and 3, remember to scale the data first – otherwise, the regression coefficien...
ALL ABOUT ISYE 6501 WEEK 8 COMPLETE SOLUTION
Question 11.1
Using the crime data set uscrime.txtfrom Questions 8.2, 9.1, and 10.1, build a regression model using:
1. Stepwise regression
2. Lasso
3. Elastic net
For Parts 2 and 3, remember to scale the data first – otherwise, the regression coefficients will be on
different scales and the constraint won’t have the desired effect.
For Parts 2 and 3, use the glmnetfunction in R.Notes
on R:
• For the elastic net model, what we called λ in the videos, glmnetcalls “alpha”; you can get a range of
results by varying alpha from 1 (lasso) to 0 (ridge regression) [and, of course, other values of alpha in
between].
• In a function call like glmnet(x,y,family=”mgaussian”,alpha=1) the predictors xneed to be in R’s matrix
format, rather than data frame format. You can convert a data frame to a matrix using as.matrix – for
example, x <- as.matrix(data[,1:n-1])
• Rather than specifying a value of T, glmnetreturns models for a variety of values of T.
Data Analysis -
The uscrime dataset is has number of offences per 10k population, this is a continuous dataset with a
set of possible “predictors” –
#Variable Description
#M percentage of males aged 14–24 in total state population #So
indicator variable for a southern state
#Ed mean years of schooling of the population aged 25 years or over #Po1
per capita expenditure on police protection in 1960
#Po2 per capita expenditure on police protection in 1959
#LF labor force participation rate of civilian urban males in the age-group 14-24 #M.F
number of males per 100 females
#Pop state population in 1960 in hundred thousand
#NW percentage of nonwhites in the population #U1
unemployment rate of urban males 14–24 #U2
unemployment rate of urban males 35–39
#Wealth wealth: median value of transferable assets or family income
#Ineq income inequality: percentage of families earning below half the median income #Prob
probability of imprisonment: ratio of number of commitments to number of offenses #Time average
time in months served by offenders in state prisons before their first release #Crime crime
rate: number of offenses per 100,000 population in 1960
First, to understand more about the data, after loading it into a table, I looked at the data summary, looked at the
box plot to check any possible outliers. Although I have not removed any data point from
,the set for this assignment’s purpose, I performed the test mostly for discovery, Crime values 1969 1674 1993
showed up at the highest 3 values outside the whiskers of the boxplot, using the grubbds test we possibly could
remove these outliers, but I skipped this step.
Later looked at the correlation matrix to check if any pair of variables are corelated to each other or not. I found that
there is a strong linear correlation between Po1 and Po2 with correlation coeff = .99. Also, the Wealth and Ineq has
a -ve correlation coeff -0.88 and they seem to be very closely negatively correlated.
I also checked the scatter plots of predictors against Crime to have visual idea of the correlations, which showed that
all of them might not be significant for out model.
,I. Stepwise regression –
The underlying assumption of stepwise regression is that the predictor variables are not very highly corelated. In
each step of the process a variable is added or subtracted from the set of predictors. If we start with o predictors and
keep adding, it’s a forward addition and if we start with all and keep removing variables, it’s a backwards selection.
In the R code I have performed a backward method for factor selection on the scaled data (except for column so).
This process showed that there could be 8 factors.
Step: AIC=503.93
.outcome ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq +
In the next step I used these 8 variables to build a regression model to check if they are indeed significant. In this
step, the adjusted R2 was .74, but not all factors were significant. I repeated this step twice and removed M.F and
U1 from the initial selection of predictors and used cross validation to evaluate the final model. This time using 6
factors the R2 was .66, not very lower than the initial suggestion of a model with 8 variables.
II. Lasso –
Least absolute shrinkage and selection operator is a shrinkage and selection method for linear regression. It
minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients.
For our purpose of predictor selection with lasso, I used –
I plotted the Cross-validated MSE vs lambda as well as number of predictor variables vs the lambda values. Then I
found the lamda value with smallest cvm and finally looked at the minimum lambd valuesfor each of the predictors.
This process showed me that there are 10 possible variables that might be significant and hence using these I created
my first model with a R2 value of .74.
#fit a model with the variables with coefficients
, mod_lasso = lm(Crime ~So+M+Ed+Po1+LF+M.F+NW+U2+Ineq+Prob, data =
uscrime_scaled) summary(mod_lasso)
But not all the predictors were significant enough, so I had to recreate the model only with the following and this
time the adjusted R2 was .73
#remove factor which p>0.05
mod_lasso_2 = lm(Crime ~M+ Ed+ Po1+ U2+ Ineq+Prob, data = uscrime_scaled)
summary(mod_lasso_2)
Before removing the 4 variables, the R2 from cross validation was .58, whereas after removing the 4 non-
significant factors it went up to .64.
III. Elastic Net –
The elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of
the lasso and ridge methods. For our analysis, I have run a loop for check with alpha values between 0 -1 in steps of
.1 and noted the R2 s of each. The best value of alpha was .9 and applying the same the cv.glmnet method calculated
the coefficients for each variable. The method gave the following 9 predictors, and I checked the R2 using them in a
regression model to be .72 but with a number of non- significant coefficients: The R2 using all these 9 predictors
after applying cross validation came to only 0.485607.
#use the predictors from the process
mod_Elastic_net = lm(Crime ~So+M+Ed+Po1+M.F+Pop+NW+U1+U2+Wealth+Ineq+Prob, data
=uscrime_scaled)
summary(mod_Elastic_net)
Comparison –
Based on the limited data we had, stepwise regression gave us the least number of predictors with a good value of
R2 and adjusted R square. Although even after applying the stepwise regression, we needed to discard some of
the variables based on the P values, still it did a better job that other two method, where elastic net chose 9
variables and lasso chose 10 to start with.
R code -
rm(list = ls())
set.seed(15)
library(MASS)
library(glmnet)
## Loading required package: Matrix
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller smartzone. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $15.49. You're not tied to anything after your purchase.