PRACTICAL 3
SUMMERY SLIDES
Automation, add-on packages and reshaping
AUTOMATION OF REPETITIVE ANALYSES
Dataset
• Genetic analysis of age-related hearing impairment
• Phenotype : Z-score
- Standardized measure for hearing quality
- Lower Z-score = better
• Association between Zscore and genotype
- SNP genotype : aa, ab, bb
• ANOVA : SNP GT = categorical
• Regression : SNP GT = 0,1,2
• Find SNP associated with phenotype
2 ways to analyse this
1) consider genotype categorical variable and do one way anova
2) Consider genotype to be a nr (count nr of rare alleles) (bb has 2 rare alleles)
Repetitive analysis
• Regression :
- one dependent numeric variable (Y)
- several independent (X) variables
• Regress Y on all separate X-variables
- à several times simple linear regression/ANOVA
1
,Y Xvar1 Xvar2 Xvar3 Xvar4
5.71 aa ab aa ab
-0.93 aa ab ab aa
2.58 ab bb bb aa
In columns the genotype
In the regression, the Y, the numeric phenotype = dependent variable (outcome)
We want to know p, which associations are significant
Don’t run all individually (anova) à smarter ways to do it trough R
R is useful for repetitive analysis
For the first Xvariable
myModel<- lm(Y ~ Xvar1)
For the second Xvariable
myModel<- lm(Y ~ Xvar2)
For the third X variable
myModel<- lm(Y ~ Xvar3)
• 1 analysis:
• myModel<- lm(Y ~ allXvars[,1])
For the first X variable
myModel<- lm(Y ~ allXvars[,1])
For the second X variable
myModel<- lm(Y ~ allXvars[,2])
For the third X variable
myModel<- lm(Y ~ allXvars[,3])
Could also run
i<-1
myModel<- lm(Y ~ allXvars[ ,i])
i<-2
myModel<- lm(Y ~ allXvars[ ,i])
i<-3
2
, myModel<- lm(Y ~ allXvars[ ,i])
Than have everytime the same formula, but you change i
Can ask R that i goes trough all the values
For-loop
• Let i run through all values from i to n, for all commands within the {curly braces}
for(i in 1:3) {
myModel<- lm(Y ~ allXvars[ ,i])
}
In curly brackets put value that i needs to run trough
Will be executed for i= 1, i=2, i=3
Parameter estimates in a loop
• Suppose in each step you estimate a parameter or a p-value
• First create empty vector to save the output from each step
p.value<-rep(NA,3)
for(i in 1:3) {
myModel<-lm(Y ~ allXvars[ ,i])
p.value[i]<-anova(myModel)[1,5]
We need p-values
Have to extract p value from that model, but we don’t want to overwrite the p-value
Have to extract and store it somewhere
First create an empty vector
Rep=repeat
Before start loop, create empty vector with 3 spaces
In the first loop you assign the result of the p value to the first place of the empty vector
When R has finished the loop, the vector has been filled and contains all p values of all loops
Automation with new function
• Piece of script, to be carried out multiple times
Data<- read.table(“input_1.txt”)
p.value<-t.test(…)
write.table(p.value,file=“output_1.txt”)
• 3 similar input files
3
, Data<- read.table(“input_1.txt”)
p.value<-t.test(…)
write.table(p.value,file=“output_1.txt”)
Data<- read.table(“input_2.txt”)
p.value<-t.test(…)
write.table(p.value,file=“output_2.txt”)
Data<- read.table(“input_3.txt”)
p.value<-t.test(…)
write.table(p.value,file=“output_3.txt”)
• Wrap the piece of code into a new function
- Give name to function
- List necessary arguments
- Use argument names in the code
doMyAnalysis<- function(inputfile,outputfile) {
Data<- read.table(inputfile)
p.value<-t.test(…)
write.table(p.value,file=outputfile)
To run the new function
• Initialize the new function
- Select code and run
- Each time you restart R
- No output
• Run
doMyAnalysis(“input_1.txt”,”output_1.txt”)
doMyAnalysis(“input_2.txt”,”output_2.txt”)
doMyAnalysis(“input_3.txt”,”output_3.txt”)
Further automation
• Using a list object
- Consists of other objects (elements)
- Individual elements accessed by double square brackets
list.object[[i]]
• Here
- Put input files and/or output files in a list
- 1 list-objects, containing the 3 input-dataframes
Combine list and for-loop
• First create empty list
Mylist<-vector(“list”,n.elements)
• Read in the 3 inputfiles
4