Resume
Summary Data Engineering
This summary Data Engineering contains the course material with extra notes in grey and is made in the year including my answers for the example exam and example questions during the course. Also contains questions of exam itself. This document is very handy to learn in a structured way (highly st...
[Montrer plus]
Dernier document publié: 4 année de cela
Publié le
21 mai 2020
Fichier mis à jour le
15 juin 2020
Nombre de pages
190
Écrit en
2019/2020
Type
Resume
Par: arnaudalloin • 8 mois de cela
Par: jeroenvandekerckhove • 4 année de cela
S'abonner
Envoyer un Message
€9,76
Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien
Data Engineering 2019-2020
Content table – Data Engineering 2019-2020
Course 1 ......................................................................................................................................................... 4
1.1 Intro ............................................................................................................................................................... 4
1.1.A defining data engineering....................................................................................................................... 4
1.1.B Course topics .......................................................................................................................................... 5
1.1.C Class format, lab sessions, exam and project ......................................................................................... 6
1.2 Basic computer architecture and operating systems .................................................................................... 7
1.2.A Basic Computer Architecture ................................................................................................................. 7
1.2.B Operating System (OS) level ................................................................................................................. 10
1.3 File formats.................................................................................................................................................. 14
1.3.A human readable file formats ................................................................................................................ 14
1.3.A.1 CSV..................................................................................................................................................... 14
1.3.A.2 XML.................................................................................................................................................... 15
1.3.A.3 JSON .................................................................................................................................................. 16
1.3.B Not human readable and compressed file formats .............................................................................. 19
1.4 Python concepts .......................................................................................................................................... 21
Course 2 ....................................................................................................................................................... 25
2.1 basic computer architecture and Operating systems (os) ........................................................................... 25
2.2 intro to computer networks......................................................................................................................... 25
2.2.A Important network applications: Web – HTTP ..................................................................................... 27
2.2.B Important network applications: DNS .................................................................................................. 30
2.2.C lab sessions ........................................................................................................................................... 30
2.3 Regular expressions (regex)......................................................................................................................... 31
2.3.A DeFInition and general application ...................................................................................................... 31
2.3.B Regular expressions in Python .............................................................................................................. 32
2.3.C Gone wrong .......................................................................................................................................... 34
2.3.D Concluding remarks .............................................................................................................................. 34
Summary ........................................................................................................................................................... 34
Course 3 ....................................................................................................................................................... 35
3.1 Basic Linux ................................................................................................................................................... 35
3.1.A linux ...................................................................................................................................................... 36
3.1.B Linux command line instructions (FIle manipulation) .......................................................................... 38
3.1.C JQ .......................................................................................................................................................... 39
3.2 Cloud Services .............................................................................................................................................. 40
3.2.A DEFIning cloud services ........................................................................................................................ 40
3.2.B Core AWS services ................................................................................................................................ 41
3.2.C Storage infrastructure .......................................................................................................................... 44
3.2.D Database services ................................................................................................................................. 44
3.2.E Cloud architecture example.................................................................................................................. 45
Summary ........................................................................................................................................................... 45
1
,Course 4 ....................................................................................................................................................... 46
4.1 algorithms and complexity .......................................................................................................................... 46
4.1.A Storting ................................................................................................................................................. 49
4.2 basic datastructures .................................................................................................................................... 53
4.2.A collections or container ........................................................................................................................ 54
A.1 List ........................................................................................................................................................... 54
A.2 set ............................................................................................................................................................ 55
A.3 map.......................................................................................................................................................... 55
4.2.B trees ...................................................................................................................................................... 55
4.2.C Hash Tables ........................................................................................................................................... 57
Summary ........................................................................................................................................................... 58
Course 5 ....................................................................................................................................................... 59
Databases.......................................................................................................................................................... 59
5.1 Data, data, data ....................................................................................................................................... 59
5.2 evolution of databases ............................................................................................................................ 59
5.3 relational databases................................................................................................................................. 60
5.4 types of databases ................................................................................................................................... 63
5.4.A type 1: production database ................................................................................................................ 63
5.4.B type 2: analytical database ................................................................................................................... 63
5.5 NoSQL Data Stores ................................................................................................................................... 64
5.6 Big Data.................................................................................................................................................... 64
Course 6&7 .................................................................................................................................................. 65
6. Parallel and distributed computing ............................................................................................................... 65
6.1 Parallel computing ................................................................................................................................... 65
6.1.A communication patterns ...................................................................................................................... 66
6.1.B Examples ............................................................................................................................................... 68
6.1.C Analysis of speedup .............................................................................................................................. 70
6.1.D Dependencies ....................................................................................................................................... 70
6.2 Distributed computing ............................................................................................................................. 71
6.3 Use cases ................................................................................................................................................. 73
7. Map reduce ................................................................................................................................................... 74
7.1 map reduce .............................................................................................................................................. 75
7.2 Map-Reduce example .............................................................................................................................. 76
7.3 SQL operations......................................................................................................................................... 77
7.4 Hadoop .................................................................................................................................................... 78
7.5 Shuffling ................................................................................................................................................... 79
7.6 matrix operations .................................................................................................................................... 79
7.7 summary .................................................................................................................................................. 80
7.8 Spark ........................................................................................................................................................ 81
7.9 the debit example on spark ..................................................................................................................... 82
7.10 indexing web pages using spark ............................................................................................................ 83
7.11 Spark functions ...................................................................................................................................... 83
7.11 use cases ................................................................................................................................................ 85
Course 8 & 9: Gdelt project .......................................................................................................................... 85
2
,Course 10 ..................................................................................................................................................... 86
10. Web api’s ..................................................................................................................................................... 86
10.1 Rest api .................................................................................................................................................. 87
10.2 Designing a REST API.............................................................................................................................. 88
10.3 demo ...................................................................................................................................................... 89
10.4 api access ............................................................................................................................................... 90
10.5 Microservices ......................................................................................................................................... 91
10.6 summary ................................................................................................................................................ 92
Course 11: closing remarks ........................................................................................................................... 93
11.1 Choose your technology stack ................................................................................................................... 93
11.2 Streaming .................................................................................................................................................. 94
11.3 Sampling .................................................................................................................................................... 94
11.4 filtering ...................................................................................................................................................... 95
11.5 Streaming technology ............................................................................................................................... 95
11.6 data warehouses ....................................................................................................................................... 96
11.7 Unstructured data ..................................................................................................................................... 98
11.8 Web API’s .................................................................................................................................................. 98
Example Exam .............................................................................................................................................. 99
Quick review of course 1-10 ....................................................................................................................... 109
Gdelt project .............................................................................................................................................. 138
3
, COURSE 1
1.1 INTRO
1.1.A DEFINING DATA ENGINEERING
Defining a data engineer by differentiating it from a data scientist
A data scientist’s principal role is to find value or discover new
opportunities in the company’s data or fulfill business needs using
that data. The data scientist/analyst uses the company’s tools and
infrastructure together with his/her knowledge of basic
mathematics, machine learning and statistics
The role of the data engineer is to provide the data scientist with
the software infrastructure for fetching and processing the data so
that the data scientist can easily explore and gain insight in the
data. He/she is responsible deploying new models and applications
typically making use of a workflow management platform
Extract/Transform/Load (ETL)
Besides supporting data science, the data engineer is more
generally responsible for the processing of data
The data engineer is responsible for
Extract/Transform/Load (ETL)implementing the interfaces that are
The data engineer is responsible for implementing the interfaces that are
necessary for managing the data flow and Data
necessary for managing the data flow and keeping the data available for source
keeping the data available for analysis
analysis
extract
The data architect is usually the person load
The data architect is usually the person responsible for the design of the
responsible for the design of the whole Data
whole system Data
transform
system source
warehouse
Typically there are many different data sources within the company. To
Typically there are many different data
enable data scientists to gain insight in that data and generate value, all
sources within the company. Toenable data
that data should be accessible in a central repository in some uniform Data
scientists to gain insight in that data and source
format
generate value, all that data should be
accessible in a central repository in some
uniform format
The data pipeline
The set of processes to automatically extract data from different sources, transform it into some uniform format and store
it in a central place defines the data pipeline
The data pipeline can also contain production models made by data scientists. Depending on the requirements these
models have to run in real-time, once per hour/day...
Data engineers need to maintain this data flow and ensure its availability and quality:
● make changes if data is added/removed
● solve bottlenecks in the pipeline
● monitor, log and solve errors
● handle duplicate, incorrect or corrupted data
● scale
● test
Workflow Management Platform
● ...
Workflow Management Platform
Image shows how we manage
this data.
We split up the data in parts,
and each split is a step, but you
don’t do every step yourself
(don’t have to reinvent the
wheel every time)
4
DAG configuration and monitoring @PrediCube