Notes on practicals
Practical 1
We know it is a function in R, because of the parenthesis.
If you are creating a dataframe in the way
dat3a <- as.data.frame(mat3)
from an abject in which the propreties are not correct, the resulting dataframe is not correct.
Therefore, you should create a dataframe from the data itself
dat3b <- data.frame(V1 = vec1, V2 = vec2)
when your objects are both numerical and characters
vec 1 <- 1, 2, 3, 4, 5, 6
vec2 <- A, B, C, D, E, F
Factor = a categorical variable with a numerical representation
With the factor function you can change the labels of your factors, assign ‘Utrecht’ to 1.
Overview of the dimensions of a dataset, rows and columns
dim(boys)
With the head and tail function, you get the first or last 6 cases. Way of inspecting your
dataset.
Labels for missing data: <NA>(non-numeric data) or NA (numeric data) means not available
Using the exclamation mark (!), turns TRUE into FALSE and FALSE into TRUE.
To inspect your data you can use different functions:
The structure function gives you an overview of the measurement levels, of the head of the
data (first few variables), and the class of the variables
str(boys)
The summary function gives you information about the distribution for numeric data, and the
table for categorical data on all the variables.
summary(boys)
If you want to explore a certain dimension, you use the dollar sign ($). For example the
standard deviation of age in the dataset boys.
sd(boys$age)
We cannot calculate a standard deviation without telling R how to deal with the missingness.
na.rn = TRUE
means remove the missing values. So, then you will only calculate the standard deviation on
the observed data.
If you want to ask for data with two combined variables, we need two separate evaluations.
mean(subset(boys, age < 15 & reg 1= “north”)$age, na.rn = TRUE)
Within the subset you specify your two dimensions, and then you only use the subset age.
When you load a dataset you can open a help-screen with
?mammalsleep
and it gives you information about the variables names.
,The input for a correlation function for each complete observed pair is
cor(sleepdata, use = “pairwise.complete.obs”)
Exclude the categorical columns, for example column one, by using
cor(sleepdata(,-1), use = “pairwise.complete.obs”)
However, the correlationmatrix has many decimals, so take this into account with the round
function. You can for example round the correlations to two decimals
round(cor(sleepdata(,-1), use = “pairwise.complete.obs”), 2)
Convenient functions, any object in the workspace can be saved.
save.image(“Practical_X.RData”)
save(sleepdata, file “Sleepdata.RData”)
If you want to exclude variables, you can do this with the names of the variables
exclude <- c(“Echida”, “Lesser short-tailed shrew”, “Musk shrew”)
which <- sleepdata$species %in% exlcude
The which is a vector with the same length of the data and when you apply this you only get
the names back by default for which it says TRUE. So your new dataset with the excluded
variables would be
sleepdata2 <- sleepdata(!which, )
When plotting your variables, you use ~ which indicates that you want to model something,
based on something else. It separates the outcome part from the predictor, allowing for a
visual representation.
plot(brw ~ species, data = sleepdata2)
If you want to find all your cases that are higher/lower than one standard deviation above the
mean, you take several steps
sd.brw <- sd(sleepdata2$brw)
mean.brw <- mean(sleepdata2$brw)
which <- sleepdata2$brw > (mean.brw + (1 * sd.brw))
as.character(sleepdata2$species[which])
So, you calculate the standard deviation and the mean of brain weight, then you make a new
object (this overrides your last used code under which). With which you calculate the
variables bigger than one standard deviation above the mean, and expose the species for
which which holds as a character.
Practical 2
Objects in R are case-sensitive. This means that
a <- 100
A <- 200
are different characters with each their own value.
To learn more about the data, use one of the two following help commands
help(nhanes)
?nhanes
To get an overview of the data, use
summary(nhanes)
, When you want to explore the missingness in the dataset you can use the summary command,
or
apply(nhanes, MARGIN = 2, FUN = function(x) sum(is.na(x)))
The code ‘applies’ the function that calculates the sum (sum()) over the missings (is.na) on a
set of data (x). The nice thing about apply is that you can apply functions on two-dimensional
objects. In this case you execute a function that calculates the sum of missings (FUN =
function(x) sum(is.na(x))) over the columns (MARGIN = 2) of object nhanes. If you would
change MARGIN = 2 to MARGIN = 1, you would do the same, but over the rows of nhanes.
The function colMeans()calculates the mean of numerical columns
colMeans(nhanes, na.rm=TRUE)
However, you have to specify how you would like to handle the missing values. By using
na.rm=TRUE
it tells R that you would like to remove (rm) the missings (na).
To determine how many cases would be available if only the complete cases were used, there
are multiple ways
1 You could look at the data and determine the number of completely observed cases
2 You could use the missing data pattern to deduce the number of cases for which the pattern
1 1 1 1 (everything observed) holds.
3 You could use code to determine the number of cases (rows) that have no missings. For
example:
nrow(na.omit(nhanes))
It performs listwise deletion on the object you use the function on. In other words, it removes
any incomplete row.
To check the missing data patter, use
md.pattern(nhanes)
Looking at the missing data pattern is always useful (but may be difficult for datasets with
many variables). It can give you an indication on how much information is missing and how
the missingness is distributed.
If you want to create a missingness indicator to indicate if your variable is missing or not
missing you create a new vector
rbmi <- is.na(nhanes$bmi)
rbmi
You create a new vector rbmi (you can see it as a variable) that indicates whether bmi is
missing (TRUE) or not missing (FALSE), with the same length as the old variable.
To test if the missingness in one factor depends on another factor perform a t-test with
t.test(age ~ rbmi, data=nhanes)
You test here whether the missingness in bmi depends on age.
With a bivariate dataset you can calculate the correlation between the variables with the
following code
cor(data)
With partially incomplete data you can use ad hoc imputation methods to impute the missing
variables.
First you need to evaluate the means and correlation of the incomplete data set.