Inhoudsopgave
WEEK 0: PREPRATION BEFORE THE COURSE STARTS (22/08-28/08) ................................................................ 3
DataCamp – Introduction to R ............................................................................................................................ 3
Chapter 1: Intro to Basics............................................................................................................................... 3
Chapter 2: Vectors ......................................................................................................................................... 3
Chapter 3: Matrices ....................................................................................................................................... 3
Chapter 4: Factors.......................................................................................................................................... 3
Chapter 5: Data frames .................................................................................................................................. 4
Chapter 6: Lists .............................................................................................................................................. 4
Table: All Functions from the Chapters ......................................................................................................... 4
WEEK 1 – LECTURE: GETTING STARTED WITH R (01/09) .................................................................................. 5
Tilburg Science Hub. Professionalize your Team Work Using Scrum / A Guide to Scrum for Researchers ......... 5
How we professionalized our teamwork ....................................................................................................... 5
Scrum in a nutshell ........................................................................................................................................ 5
Why Scrum is useful and how to make it a success ....................................................................................... 5
In-Class Tutorial: Getting to Know R ................................................................................................................... 6
After-Class Tutorial: R for Social Scientists ......................................................................................................... 6
Chapter 1: Before we Start ............................................................................................................................ 6
Chapter 2: Introduction to R .......................................................................................................................... 6
Chapter 3: Starting with Data ........................................................................................................................ 6
Chapter 4: Data Wrangling with dplyr and tidyr ............................................................................................ 6
WEEK 2 – TUTORIAL: PROJECT MANAGEMENT AND VERSION CONTROL (08/09)............................................. 7
The GitHub Training Team. Introduction to GitHub ............................................................................................ 7
DataCamp – Introduction to Shell ...................................................................................................................... 7
Chapter 1: Manipulating files and directories ............................................................................................... 7
Tilburg Science Hub. Principles of Project Setup and Workflow Management ................................................... 8
Project Setup Overview ................................................................................................................................. 8
Pipelines and Project Components ................................................................................................................ 9
Data Management and Directory Structure .................................................................................................. 9
Automating your Pipeline ............................................................................................................................ 11
Documenting Datasets................................................................................................................................. 11
Documenting Source Code and Pipeline Workflows ................................................................................... 12
Versioning using Git and GitHub .................................................................................................................. 12
Collaborating using GitHub .......................................................................................................................... 12
Checklist to Audit Data- and Computation-intensive Projects .................................................................... 12
In-Class Tutorial: GitHub ................................................................................................................................... 13
After-Class Tutorial: Version Control ................................................................................................................ 13
1. Getting started with version control using Git and GitHub ..................................................................... 13
2. The end-to-end Git workflow................................................................................................................... 13
3. Advanced Git Workflows ......................................................................................................................... 14
WEEK 3 – TUTORIAL: DATA EXPLORATION USING RMARKDOWN (15/09) ..................................................... 15
DataCamp – Intermediate R ............................................................................................................................. 15
Chapter 1: Conditionals and Control Flow ................................................................................................... 15
Chapter 2: Loops .......................................................................................................................................... 16
Chapter 3: Functions .................................................................................................................................... 16
, Table: All Functions from the Chapters ....................................................................................................... 17
R for Social Scientists ........................................................................................................................................ 18
Chapter 6: Getting Started with R Markdown ............................................................................................. 18
In-Class Tutorial: Exploring and Auditing New Data with RMarkdown ............................................................ 19
After-Class Tutorial: Data Exploration in R ....................................................................................................... 21
WEEK 4 – TUTORIAL: ENGINEERING DATASETS (22/09) ................................................................................. 24
DataCamp – Introduction to the Tidyverse ....................................................................................................... 24
Chapter 1: Data Wrangling .......................................................................................................................... 24
Chapter 3: Grouping and summarizing ........................................................................................................ 24
Table: All Functions from the Chapters ....................................................................................................... 25
DataCamp – Cleaning Data in R ....................................................................................................................... 25
Chapter 1: Common Data Problems ............................................................................................................ 25
Chapter 2: Categorical and Text Data .......................................................................................................... 27
Table: All Functions from the Chapters ....................................................................................................... 28
DataCamp – Joining Data with dplyr ................................................................................................................ 29
Chapter 1: Joining Tables ............................................................................................................................. 29
Chapter 2: Left and Right Joins .................................................................................................................... 29
Table: All Functions from the Chapters ....................................................................................................... 30
Opidi, A. (2019, September 19). Solving Data Challenges in Machine Learning With Automated Tools.
TOPBOTS ........................................................................................................................................................... 30
Introduction ................................................................................................................................................. 30
The data preparation process ...................................................................................................................... 30
1. Data collection ......................................................................................................................................... 30
2. Data preprocessing .................................................................................................................................. 31
3. Data transformation ................................................................................................................................ 31
Data preparation challenges ........................................................................................................................ 31
Solutions to accelerate data preparation .................................................................................................... 31
In-Class Tutorial: Engineering Data Sets ........................................................................................................... 31
WEEK 5 – TUTORIAL: PIPELINE AUTOMATION (28/09) .................................................................................. 33
Video: This is why you should automate the pipeline of your research project. (4:54) .................................... 33
Video: Four steps to automate & make reproducible your empirical research project. (12:27) ....................... 33
Tilburg Science Hub. Use Makefiles to Re-Run Your Code ................................................................................ 35
Overview ...................................................................................................................................................... 35
Code ............................................................................................................................................................. 35
Advanced Use Cases .................................................................................................................................... 35
In-Class Tutorial: Pipeline Building and Automation ........................................................................................ 36
After-Class Exercises: Pipeline Building and Automation.................................................................................. 37
,WEEK 0: PREPRATION BEFORE THE COURSE STARTS (22/08-28/08)
Literature
DataCamp – Introduction to R:
• Chapter 1: Intro to Basics
• Chapter 2: Vectors
• Chapter 3: Matrices
• Chapter 4: Factors
• Chapter 5: Data frames
• Chapter 6: Lists
DataCamp – Introduction to R
Chapter 1: Intro to Basics
R works with numerous data types:
• Decimal values like 4.5 are called numerics.
• Whole numbers like 4 are called integers. Integers are also numerics.
• Boolean values (𝑇𝑅𝑈𝐸 or 𝐹𝐴𝐿𝑆𝐸) are called logical.
• Text (or string) values are called characters.
You can check the data type of a variable beforehand by using the class() function.
Chapter 2: Vectors
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. In other
words, a vector is a simple tool to store data. In R, you create a vector with the combine function c().
You place the vector elements separated by a comma between the parentheses.
You can give a name to the elements of a vector with the names() function.
The (logical) comparison operators known to R are:
• < for less than.
• > for greater than.
• <= for less than or equal to.
• >= for greater than or equal to.
• == for equal to each other.
• ! = for not equal to each other.
To select elements of a vector, you can use square brackets. For example, to select the first element of
the vector, you add [1] to the vector name.
Chapter 3: Matrices
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged
into a fixed number of rows and columns.
Similar to vectors, you can add names for the rows and the columns of a matrix, using rownames() and
colnames(), respectively.
Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix.
Some examples:
• my_matrix[1,2] selects the element at the first row and second column.
• my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.
• my_matrix[,1] selects all elements of the first column.
• my_matrix[1,] selects all elements of the first row.
Chapter 4: Factors
The term factor refers to a statistical data type used to store categorical variables. The difference
between a categorical variable and a continuous variable is that a categorial variable can belong to a
limited number of categories. A continuous variable, on the other hand, can correspond to an infinite
number of values.
There are two types of categorical variables: a nominal categorical variable and an ordinal categorical
variable. A nominal variable is a categorical variable without an implied order. This means that it is
impossible to say that ‘one is worth more than the other’. In contrast, ordinal variables do have a natural
ordering.
, Chapter 5: Data frames
A data frame has the variables of a dataset as columns and the observations as rows. You construct a
data frame with the data.frame() function.
Similar to vectors and matrices, you select elements from a data frame with the help of square brackets
[ ].
Chapter 6: Lists
A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an
ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even
required that these objects are related to each other in any way.
Table: All Functions from the Chapters
Ch. Function Description Example
1 class() Check the data type of a variable. class(my_integer)
2 c() Create a vector. poker_vector <- c(140, -50, 20, -120, 240)
2 names() Assign a name to the elements of a some_vector <- c(“John Doe”, “poker player”)
vector. names(some_vector) <- c(“Name”,
“Profession”)
2 sum() Calculates the sum of all elements total_poker <- sum(poker_vector)
of a vector.
2 mean() Calculates the average of the mean(poker_start)
values.
3 matrix() Construct a matrix. matrix(1:9, byrow = TRUE, nrow = 3)
3 rownames() Name rows of a matrix. rownames(my_matrix) <- row_names_vector
3 colnames() Name columns of a matrix. colnames(my_matrix) <- col_names_vector
3 rowSums() Calculates the totals for each row of rowSums(my_matrix)
a matrix and stores it in a new
vector.
3 colSums() Calculates the totals for each colSums(my_matrix)
column of a matrix and stores it in a
new vector.
3 cbind() Merge matrices and/or vectors big_matrix <- cbind(matrix1, matrix2,
together by column. vector1)
3 rbind() Merge matrices and/or vectors big_matrix <- rbind(matrix1, matrix2,
together by row. vector1)
3 ls() Check out the contents of the ls()
workspace.
4 factor() Used to encode a vector as a factor. factor(some_vector, ordered = TRUE, levels
= c(“lev1”, “lev2”, “lev3”))
4 levels() Change the names of factor levels. levels(factor_vector) <- c(“name1”, “name2”)
4 summary() Gives you a quick overview of the summary(my_var)
contents of a variable.
5 head() Shows the first observations of a head(df)
data frame.
5 tail() Shows the last observations in your tail(df)
dataset.
5 str() Shows you the structure of your str(df)
dataset.
5 data.frame() Construct a data frame. data.frame(vector1, vector2, vector3)
5 subset() Take a subset of your data frame. subset(planets_df, subset = rings)
5 order() Gives you the ranked position of order(df)
each element.
6 list() Construct a list. my_list <- list(comp1, comp2, comp3)