Skills: Data Preparation and Workflow Management (328059M3)
Summary
Summary Data Preparation and Workflow Management (dPrep) 2022/2023 - All Lectures, Readings, Tutorials
27 views 3 purchases
Course
Skills: Data Preparation and Workflow Management (328059M3)
Institution
Tilburg University (UVT)
Summary of all the readings, lectures, and tutorials (incl. answers) for the course Data Preparation and Workflow Management (dPrep). Not in a bullet-list type of way so you have to figure everything out yourself, but in clear, concise language.
This file is a must-have for the open-book exam of t...
Skills: Data Preparation and Workflow Management (328059M3)
All documents for this subject (1)
Seller
Follow
tilburgsamenvattingen
Reviews received
Content preview
Inhoudsopgave
WEEK 0: PREPRATION BEFORE THE COURSE STARTS (22/08-28/08) ................................................................ 3
DataCamp – Introduction to R ............................................................................................................................ 3
Chapter 1: Intro to Basics............................................................................................................................... 3
Chapter 2: Vectors ......................................................................................................................................... 3
Chapter 3: Matrices ....................................................................................................................................... 3
Chapter 4: Factors.......................................................................................................................................... 3
Chapter 5: Data frames .................................................................................................................................. 4
Chapter 6: Lists .............................................................................................................................................. 4
Table: All Functions from the Chapters ......................................................................................................... 4
WEEK 1 – LECTURE: GETTING STARTED WITH R (01/09) .................................................................................. 5
Tilburg Science Hub. Professionalize your Team Work Using Scrum / A Guide to Scrum for Researchers ......... 5
How we professionalized our teamwork ....................................................................................................... 5
Scrum in a nutshell ........................................................................................................................................ 5
Why Scrum is useful and how to make it a success ....................................................................................... 5
In-Class Tutorial: Getting to Know R ................................................................................................................... 6
After-Class Tutorial: R for Social Scientists ......................................................................................................... 6
Chapter 1: Before we Start ............................................................................................................................ 6
Chapter 2: Introduction to R .......................................................................................................................... 6
Chapter 3: Starting with Data ........................................................................................................................ 6
Chapter 4: Data Wrangling with dplyr and tidyr ............................................................................................ 6
WEEK 2 – TUTORIAL: PROJECT MANAGEMENT AND VERSION CONTROL (08/09)............................................. 7
The GitHub Training Team. Introduction to GitHub ............................................................................................ 7
DataCamp – Introduction to Shell ...................................................................................................................... 7
Chapter 1: Manipulating files and directories ............................................................................................... 7
Tilburg Science Hub. Principles of Project Setup and Workflow Management ................................................... 8
Project Setup Overview ................................................................................................................................. 8
Pipelines and Project Components ................................................................................................................ 9
Data Management and Directory Structure .................................................................................................. 9
Automating your Pipeline ............................................................................................................................ 11
Documenting Datasets................................................................................................................................. 11
Documenting Source Code and Pipeline Workflows ................................................................................... 12
Versioning using Git and GitHub .................................................................................................................. 12
Collaborating using GitHub .......................................................................................................................... 12
Checklist to Audit Data- and Computation-intensive Projects .................................................................... 12
In-Class Tutorial: GitHub ................................................................................................................................... 13
After-Class Tutorial: Version Control ................................................................................................................ 13
1. Getting started with version control using Git and GitHub ..................................................................... 13
2. The end-to-end Git workflow................................................................................................................... 13
3. Advanced Git Workflows ......................................................................................................................... 14
WEEK 3 – TUTORIAL: DATA EXPLORATION USING RMARKDOWN (15/09) ..................................................... 15
DataCamp – Intermediate R ............................................................................................................................. 15
Chapter 1: Conditionals and Control Flow ................................................................................................... 15
Chapter 2: Loops .......................................................................................................................................... 16
Chapter 3: Functions .................................................................................................................................... 16
, Table: All Functions from the Chapters ....................................................................................................... 17
R for Social Scientists ........................................................................................................................................ 18
Chapter 6: Getting Started with R Markdown ............................................................................................. 18
In-Class Tutorial: Exploring and Auditing New Data with RMarkdown ............................................................ 19
After-Class Tutorial: Data Exploration in R ....................................................................................................... 21
WEEK 4 – TUTORIAL: ENGINEERING DATASETS (22/09) ................................................................................. 24
DataCamp – Introduction to the Tidyverse ....................................................................................................... 24
Chapter 1: Data Wrangling .......................................................................................................................... 24
Chapter 3: Grouping and summarizing ........................................................................................................ 24
Table: All Functions from the Chapters ....................................................................................................... 25
DataCamp – Cleaning Data in R ....................................................................................................................... 25
Chapter 1: Common Data Problems ............................................................................................................ 25
Chapter 2: Categorical and Text Data .......................................................................................................... 27
Table: All Functions from the Chapters ....................................................................................................... 28
DataCamp – Joining Data with dplyr ................................................................................................................ 29
Chapter 1: Joining Tables ............................................................................................................................. 29
Chapter 2: Left and Right Joins .................................................................................................................... 29
Table: All Functions from the Chapters ....................................................................................................... 30
Opidi, A. (2019, September 19). Solving Data Challenges in Machine Learning With Automated Tools.
TOPBOTS ........................................................................................................................................................... 30
Introduction ................................................................................................................................................. 30
The data preparation process ...................................................................................................................... 30
1. Data collection ......................................................................................................................................... 30
2. Data preprocessing .................................................................................................................................. 31
3. Data transformation ................................................................................................................................ 31
Data preparation challenges ........................................................................................................................ 31
Solutions to accelerate data preparation .................................................................................................... 31
In-Class Tutorial: Engineering Data Sets ........................................................................................................... 31
WEEK 5 – TUTORIAL: PIPELINE AUTOMATION (28/09) .................................................................................. 33
Video: This is why you should automate the pipeline of your research project. (4:54) .................................... 33
Video: Four steps to automate & make reproducible your empirical research project. (12:27) ....................... 33
Tilburg Science Hub. Use Makefiles to Re-Run Your Code ................................................................................ 35
Overview ...................................................................................................................................................... 35
Code ............................................................................................................................................................. 35
Advanced Use Cases .................................................................................................................................... 35
In-Class Tutorial: Pipeline Building and Automation ........................................................................................ 36
After-Class Exercises: Pipeline Building and Automation.................................................................................. 37
,WEEK 0: PREPRATION BEFORE THE COURSE STARTS (22/08-28/08)
Literature
DataCamp – Introduction to R:
• Chapter 1: Intro to Basics
• Chapter 2: Vectors
• Chapter 3: Matrices
• Chapter 4: Factors
• Chapter 5: Data frames
• Chapter 6: Lists
DataCamp – Introduction to R
Chapter 1: Intro to Basics
R works with numerous data types:
• Decimal values like 4.5 are called numerics.
• Whole numbers like 4 are called integers. Integers are also numerics.
• Boolean values (𝑇𝑅𝑈𝐸 or 𝐹𝐴𝐿𝑆𝐸) are called logical.
• Text (or string) values are called characters.
You can check the data type of a variable beforehand by using the class() function.
Chapter 2: Vectors
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. In other
words, a vector is a simple tool to store data. In R, you create a vector with the combine function c().
You place the vector elements separated by a comma between the parentheses.
You can give a name to the elements of a vector with the names() function.
The (logical) comparison operators known to R are:
• < for less than.
• > for greater than.
• <= for less than or equal to.
• >= for greater than or equal to.
• == for equal to each other.
• ! = for not equal to each other.
To select elements of a vector, you can use square brackets. For example, to select the first element of
the vector, you add [1] to the vector name.
Chapter 3: Matrices
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged
into a fixed number of rows and columns.
Similar to vectors, you can add names for the rows and the columns of a matrix, using rownames() and
colnames(), respectively.
Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix.
Some examples:
• my_matrix[1,2] selects the element at the first row and second column.
• my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.
• my_matrix[,1] selects all elements of the first column.
• my_matrix[1,] selects all elements of the first row.
Chapter 4: Factors
The term factor refers to a statistical data type used to store categorical variables. The difference
between a categorical variable and a continuous variable is that a categorial variable can belong to a
limited number of categories. A continuous variable, on the other hand, can correspond to an infinite
number of values.
There are two types of categorical variables: a nominal categorical variable and an ordinal categorical
variable. A nominal variable is a categorical variable without an implied order. This means that it is
impossible to say that ‘one is worth more than the other’. In contrast, ordinal variables do have a natural
ordering.
, Chapter 5: Data frames
A data frame has the variables of a dataset as columns and the observations as rows. You construct a
data frame with the data.frame() function.
Similar to vectors and matrices, you select elements from a data frame with the help of square brackets
[ ].
Chapter 6: Lists
A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an
ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even
required that these objects are related to each other in any way.
Table: All Functions from the Chapters
Ch. Function Description Example
1 class() Check the data type of a variable. class(my_integer)
2 c() Create a vector. poker_vector <- c(140, -50, 20, -120, 240)
2 names() Assign a name to the elements of a some_vector <- c(“John Doe”, “poker player”)
vector. names(some_vector) <- c(“Name”,
“Profession”)
2 sum() Calculates the sum of all elements total_poker <- sum(poker_vector)
of a vector.
2 mean() Calculates the average of the mean(poker_start)
values.
3 matrix() Construct a matrix. matrix(1:9, byrow = TRUE, nrow = 3)
3 rownames() Name rows of a matrix. rownames(my_matrix) <- row_names_vector
3 colnames() Name columns of a matrix. colnames(my_matrix) <- col_names_vector
3 rowSums() Calculates the totals for each row of rowSums(my_matrix)
a matrix and stores it in a new
vector.
3 colSums() Calculates the totals for each colSums(my_matrix)
column of a matrix and stores it in a
new vector.
3 cbind() Merge matrices and/or vectors big_matrix <- cbind(matrix1, matrix2,
together by column. vector1)
3 rbind() Merge matrices and/or vectors big_matrix <- rbind(matrix1, matrix2,
together by row. vector1)
3 ls() Check out the contents of the ls()
workspace.
4 factor() Used to encode a vector as a factor. factor(some_vector, ordered = TRUE, levels
= c(“lev1”, “lev2”, “lev3”))
4 levels() Change the names of factor levels. levels(factor_vector) <- c(“name1”, “name2”)
4 summary() Gives you a quick overview of the summary(my_var)
contents of a variable.
5 head() Shows the first observations of a head(df)
data frame.
5 tail() Shows the last observations in your tail(df)
dataset.
5 str() Shows you the structure of your str(df)
dataset.
5 data.frame() Construct a data frame. data.frame(vector1, vector2, vector3)
5 subset() Take a subset of your data frame. subset(planets_df, subset = rings)
5 order() Gives you the ranked position of order(df)
each element.
6 list() Construct a list. my_list <- list(comp1, comp2, comp3)
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller tilburgsamenvattingen. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $11.27. You're not tied to anything after your purchase.