100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary Data Engineering €9,76   In winkelwagen

Samenvatting

Summary Data Engineering

2 beoordelingen
 409 keer bekeken  24 keer verkocht

This summary Data Engineering contains the course material with extra notes in grey and is made in the year including my answers for the example exam and example questions during the course. Also contains questions of exam itself. This document is very handy to learn in a structured way (highly st...

[Meer zien]
Laatste update van het document: 4 jaar geleden

Voorbeeld 4 van de 190  pagina's

  • 21 mei 2020
  • 15 juni 2020
  • 190
  • 2019/2020
  • Samenvatting
Alle documenten voor dit vak (5)

2  beoordelingen

review-writer-avatar

Door: arnaudalloin • 7 maanden geleden

review-writer-avatar

Door: jeroenvandekerckhove • 4 jaar geleden

avatar-seller
julievantroyen
Data Engineering 2019-2020
Content table – Data Engineering 2019-2020

Course 1 ......................................................................................................................................................... 4
1.1 Intro ............................................................................................................................................................... 4
1.1.A defining data engineering....................................................................................................................... 4
1.1.B Course topics .......................................................................................................................................... 5
1.1.C Class format, lab sessions, exam and project ......................................................................................... 6
1.2 Basic computer architecture and operating systems .................................................................................... 7
1.2.A Basic Computer Architecture ................................................................................................................. 7
1.2.B Operating System (OS) level ................................................................................................................. 10
1.3 File formats.................................................................................................................................................. 14
1.3.A human readable file formats ................................................................................................................ 14
1.3.A.1 CSV..................................................................................................................................................... 14
1.3.A.2 XML.................................................................................................................................................... 15
1.3.A.3 JSON .................................................................................................................................................. 16
1.3.B Not human readable and compressed file formats .............................................................................. 19
1.4 Python concepts .......................................................................................................................................... 21

Course 2 ....................................................................................................................................................... 25
2.1 basic computer architecture and Operating systems (os) ........................................................................... 25
2.2 intro to computer networks......................................................................................................................... 25
2.2.A Important network applications: Web – HTTP ..................................................................................... 27
2.2.B Important network applications: DNS .................................................................................................. 30
2.2.C lab sessions ........................................................................................................................................... 30
2.3 Regular expressions (regex)......................................................................................................................... 31
2.3.A DeFInition and general application ...................................................................................................... 31
2.3.B Regular expressions in Python .............................................................................................................. 32
2.3.C Gone wrong .......................................................................................................................................... 34
2.3.D Concluding remarks .............................................................................................................................. 34
Summary ........................................................................................................................................................... 34

Course 3 ....................................................................................................................................................... 35
3.1 Basic Linux ................................................................................................................................................... 35
3.1.A linux ...................................................................................................................................................... 36
3.1.B Linux command line instructions (FIle manipulation) .......................................................................... 38
3.1.C JQ .......................................................................................................................................................... 39
3.2 Cloud Services .............................................................................................................................................. 40
3.2.A DEFIning cloud services ........................................................................................................................ 40
3.2.B Core AWS services ................................................................................................................................ 41
3.2.C Storage infrastructure .......................................................................................................................... 44
3.2.D Database services ................................................................................................................................. 44
3.2.E Cloud architecture example.................................................................................................................. 45
Summary ........................................................................................................................................................... 45




1

,Course 4 ....................................................................................................................................................... 46
4.1 algorithms and complexity .......................................................................................................................... 46
4.1.A Storting ................................................................................................................................................. 49
4.2 basic datastructures .................................................................................................................................... 53
4.2.A collections or container ........................................................................................................................ 54
A.1 List ........................................................................................................................................................... 54
A.2 set ............................................................................................................................................................ 55
A.3 map.......................................................................................................................................................... 55
4.2.B trees ...................................................................................................................................................... 55
4.2.C Hash Tables ........................................................................................................................................... 57
Summary ........................................................................................................................................................... 58

Course 5 ....................................................................................................................................................... 59
Databases.......................................................................................................................................................... 59
5.1 Data, data, data ....................................................................................................................................... 59
5.2 evolution of databases ............................................................................................................................ 59
5.3 relational databases................................................................................................................................. 60
5.4 types of databases ................................................................................................................................... 63
5.4.A type 1: production database ................................................................................................................ 63
5.4.B type 2: analytical database ................................................................................................................... 63
5.5 NoSQL Data Stores ................................................................................................................................... 64
5.6 Big Data.................................................................................................................................................... 64

Course 6&7 .................................................................................................................................................. 65
6. Parallel and distributed computing ............................................................................................................... 65
6.1 Parallel computing ................................................................................................................................... 65
6.1.A communication patterns ...................................................................................................................... 66
6.1.B Examples ............................................................................................................................................... 68
6.1.C Analysis of speedup .............................................................................................................................. 70
6.1.D Dependencies ....................................................................................................................................... 70
6.2 Distributed computing ............................................................................................................................. 71
6.3 Use cases ................................................................................................................................................. 73
7. Map reduce ................................................................................................................................................... 74
7.1 map reduce .............................................................................................................................................. 75
7.2 Map-Reduce example .............................................................................................................................. 76
7.3 SQL operations......................................................................................................................................... 77
7.4 Hadoop .................................................................................................................................................... 78
7.5 Shuffling ................................................................................................................................................... 79
7.6 matrix operations .................................................................................................................................... 79
7.7 summary .................................................................................................................................................. 80
7.8 Spark ........................................................................................................................................................ 81
7.9 the debit example on spark ..................................................................................................................... 82
7.10 indexing web pages using spark ............................................................................................................ 83
7.11 Spark functions ...................................................................................................................................... 83
7.11 use cases ................................................................................................................................................ 85

Course 8 & 9: Gdelt project .......................................................................................................................... 85




2

,Course 10 ..................................................................................................................................................... 86
10. Web api’s ..................................................................................................................................................... 86
10.1 Rest api .................................................................................................................................................. 87
10.2 Designing a REST API.............................................................................................................................. 88
10.3 demo ...................................................................................................................................................... 89
10.4 api access ............................................................................................................................................... 90
10.5 Microservices ......................................................................................................................................... 91
10.6 summary ................................................................................................................................................ 92

Course 11: closing remarks ........................................................................................................................... 93
11.1 Choose your technology stack ................................................................................................................... 93
11.2 Streaming .................................................................................................................................................. 94
11.3 Sampling .................................................................................................................................................... 94
11.4 filtering ...................................................................................................................................................... 95
11.5 Streaming technology ............................................................................................................................... 95
11.6 data warehouses ....................................................................................................................................... 96
11.7 Unstructured data ..................................................................................................................................... 98
11.8 Web API’s .................................................................................................................................................. 98

Example Exam .............................................................................................................................................. 99

Quick review of course 1-10 ....................................................................................................................... 109

Gdelt project .............................................................................................................................................. 138




3

, COURSE 1

1.1 INTRO

1.1.A DEFINING DATA ENGINEERING
Defining a data engineer by differentiating it from a data scientist
A data scientist’s principal role is to find value or discover new
opportunities in the company’s data or fulfill business needs using
that data. The data scientist/analyst uses the company’s tools and
infrastructure together with his/her knowledge of basic
mathematics, machine learning and statistics

The role of the data engineer is to provide the data scientist with
the software infrastructure for fetching and processing the data so
that the data scientist can easily explore and gain insight in the
data. He/she is responsible deploying new models and applications
typically making use of a workflow management platform

Extract/Transform/Load (ETL)
Besides supporting data science, the data engineer is more
generally responsible for the processing of data

The data engineer is responsible for
Extract/Transform/Load (ETL)implementing the interfaces that are
The data engineer is responsible for implementing the interfaces that are
necessary for managing the data flow and Data
necessary for managing the data flow and keeping the data available for source
keeping the data available for analysis
analysis
extract
The data architect is usually the person load
The data architect is usually the person responsible for the design of the
responsible for the design of the whole Data
whole system Data
transform
system source
warehouse
Typically there are many different data sources within the company. To
Typically there are many different data
enable data scientists to gain insight in that data and generate value, all
sources within the company. Toenable data
that data should be accessible in a central repository in some uniform Data
scientists to gain insight in that data and source
format
generate value, all that data should be
accessible in a central repository in some
uniform format
The data pipeline
The set of processes to automatically extract data from different sources, transform it into some uniform format and store
it in a central place defines the data pipeline

The data pipeline can also contain production models made by data scientists. Depending on the requirements these
models have to run in real-time, once per hour/day...
Data engineers need to maintain this data flow and ensure its availability and quality:
● make changes if data is added/removed
● solve bottlenecks in the pipeline
● monitor, log and solve errors
● handle duplicate, incorrect or corrupted data
● scale
● test
Workflow Management Platform
● ...

Workflow Management Platform
Image shows how we manage
this data.
We split up the data in parts,
and each split is a step, but you
don’t do every step yourself
(don’t have to reinvent the
wheel every time)




4
DAG configuration and monitoring @PrediCube

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

√  	Verzekerd van kwaliteit door reviews

√ Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper julievantroyen. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €9,76. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 73918 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€9,76  24x  verkocht
  • (2)
  Kopen