100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Data Engineering $10.49
Add to cart

Summary

Summary Data Engineering

2 reviews
 410 views  24 purchases
  • Course
  • Institution

This summary Data Engineering contains the course material with extra notes in grey and is made in the year including my answers for the example exam and example questions during the course. Also contains questions of exam itself. This document is very handy to learn in a structured way (highly st...

[Show more]
Last document update: 4 year ago

Preview 4 out of 190  pages

  • May 21, 2020
  • June 15, 2020
  • 190
  • 2019/2020
  • Summary

2  reviews

review-writer-avatar

By: arnaudalloin • 8 months ago

review-writer-avatar

By: jeroenvandekerckhove • 4 year ago

avatar-seller
Data Engineering 2019-2020
Content table – Data Engineering 2019-2020

Course 1 ......................................................................................................................................................... 4
1.1 Intro ............................................................................................................................................................... 4
1.1.A defining data engineering....................................................................................................................... 4
1.1.B Course topics .......................................................................................................................................... 5
1.1.C Class format, lab sessions, exam and project ......................................................................................... 6
1.2 Basic computer architecture and operating systems .................................................................................... 7
1.2.A Basic Computer Architecture ................................................................................................................. 7
1.2.B Operating System (OS) level ................................................................................................................. 10
1.3 File formats.................................................................................................................................................. 14
1.3.A human readable file formats ................................................................................................................ 14
1.3.A.1 CSV..................................................................................................................................................... 14
1.3.A.2 XML.................................................................................................................................................... 15
1.3.A.3 JSON .................................................................................................................................................. 16
1.3.B Not human readable and compressed file formats .............................................................................. 19
1.4 Python concepts .......................................................................................................................................... 21

Course 2 ....................................................................................................................................................... 25
2.1 basic computer architecture and Operating systems (os) ........................................................................... 25
2.2 intro to computer networks......................................................................................................................... 25
2.2.A Important network applications: Web – HTTP ..................................................................................... 27
2.2.B Important network applications: DNS .................................................................................................. 30
2.2.C lab sessions ........................................................................................................................................... 30
2.3 Regular expressions (regex)......................................................................................................................... 31
2.3.A DeFInition and general application ...................................................................................................... 31
2.3.B Regular expressions in Python .............................................................................................................. 32
2.3.C Gone wrong .......................................................................................................................................... 34
2.3.D Concluding remarks .............................................................................................................................. 34
Summary ........................................................................................................................................................... 34

Course 3 ....................................................................................................................................................... 35
3.1 Basic Linux ................................................................................................................................................... 35
3.1.A linux ...................................................................................................................................................... 36
3.1.B Linux command line instructions (FIle manipulation) .......................................................................... 38
3.1.C JQ .......................................................................................................................................................... 39
3.2 Cloud Services .............................................................................................................................................. 40
3.2.A DEFIning cloud services ........................................................................................................................ 40
3.2.B Core AWS services ................................................................................................................................ 41
3.2.C Storage infrastructure .......................................................................................................................... 44
3.2.D Database services ................................................................................................................................. 44
3.2.E Cloud architecture example.................................................................................................................. 45
Summary ........................................................................................................................................................... 45




1

,Course 4 ....................................................................................................................................................... 46
4.1 algorithms and complexity .......................................................................................................................... 46
4.1.A Storting ................................................................................................................................................. 49
4.2 basic datastructures .................................................................................................................................... 53
4.2.A collections or container ........................................................................................................................ 54
A.1 List ........................................................................................................................................................... 54
A.2 set ............................................................................................................................................................ 55
A.3 map.......................................................................................................................................................... 55
4.2.B trees ...................................................................................................................................................... 55
4.2.C Hash Tables ........................................................................................................................................... 57
Summary ........................................................................................................................................................... 58

Course 5 ....................................................................................................................................................... 59
Databases.......................................................................................................................................................... 59
5.1 Data, data, data ....................................................................................................................................... 59
5.2 evolution of databases ............................................................................................................................ 59
5.3 relational databases................................................................................................................................. 60
5.4 types of databases ................................................................................................................................... 63
5.4.A type 1: production database ................................................................................................................ 63
5.4.B type 2: analytical database ................................................................................................................... 63
5.5 NoSQL Data Stores ................................................................................................................................... 64
5.6 Big Data.................................................................................................................................................... 64

Course 6&7 .................................................................................................................................................. 65
6. Parallel and distributed computing ............................................................................................................... 65
6.1 Parallel computing ................................................................................................................................... 65
6.1.A communication patterns ...................................................................................................................... 66
6.1.B Examples ............................................................................................................................................... 68
6.1.C Analysis of speedup .............................................................................................................................. 70
6.1.D Dependencies ....................................................................................................................................... 70
6.2 Distributed computing ............................................................................................................................. 71
6.3 Use cases ................................................................................................................................................. 73
7. Map reduce ................................................................................................................................................... 74
7.1 map reduce .............................................................................................................................................. 75
7.2 Map-Reduce example .............................................................................................................................. 76
7.3 SQL operations......................................................................................................................................... 77
7.4 Hadoop .................................................................................................................................................... 78
7.5 Shuffling ................................................................................................................................................... 79
7.6 matrix operations .................................................................................................................................... 79
7.7 summary .................................................................................................................................................. 80
7.8 Spark ........................................................................................................................................................ 81
7.9 the debit example on spark ..................................................................................................................... 82
7.10 indexing web pages using spark ............................................................................................................ 83
7.11 Spark functions ...................................................................................................................................... 83
7.11 use cases ................................................................................................................................................ 85

Course 8 & 9: Gdelt project .......................................................................................................................... 85




2

,Course 10 ..................................................................................................................................................... 86
10. Web api’s ..................................................................................................................................................... 86
10.1 Rest api .................................................................................................................................................. 87
10.2 Designing a REST API.............................................................................................................................. 88
10.3 demo ...................................................................................................................................................... 89
10.4 api access ............................................................................................................................................... 90
10.5 Microservices ......................................................................................................................................... 91
10.6 summary ................................................................................................................................................ 92

Course 11: closing remarks ........................................................................................................................... 93
11.1 Choose your technology stack ................................................................................................................... 93
11.2 Streaming .................................................................................................................................................. 94
11.3 Sampling .................................................................................................................................................... 94
11.4 filtering ...................................................................................................................................................... 95
11.5 Streaming technology ............................................................................................................................... 95
11.6 data warehouses ....................................................................................................................................... 96
11.7 Unstructured data ..................................................................................................................................... 98
11.8 Web API’s .................................................................................................................................................. 98

Example Exam .............................................................................................................................................. 99

Quick review of course 1-10 ....................................................................................................................... 109

Gdelt project .............................................................................................................................................. 138




3

, COURSE 1

1.1 INTRO

1.1.A DEFINING DATA ENGINEERING
Defining a data engineer by differentiating it from a data scientist
A data scientist’s principal role is to find value or discover new
opportunities in the company’s data or fulfill business needs using
that data. The data scientist/analyst uses the company’s tools and
infrastructure together with his/her knowledge of basic
mathematics, machine learning and statistics

The role of the data engineer is to provide the data scientist with
the software infrastructure for fetching and processing the data so
that the data scientist can easily explore and gain insight in the
data. He/she is responsible deploying new models and applications
typically making use of a workflow management platform

Extract/Transform/Load (ETL)
Besides supporting data science, the data engineer is more
generally responsible for the processing of data

The data engineer is responsible for
Extract/Transform/Load (ETL)implementing the interfaces that are
The data engineer is responsible for implementing the interfaces that are
necessary for managing the data flow and Data
necessary for managing the data flow and keeping the data available for source
keeping the data available for analysis
analysis
extract
The data architect is usually the person load
The data architect is usually the person responsible for the design of the
responsible for the design of the whole Data
whole system Data
transform
system source
warehouse
Typically there are many different data sources within the company. To
Typically there are many different data
enable data scientists to gain insight in that data and generate value, all
sources within the company. Toenable data
that data should be accessible in a central repository in some uniform Data
scientists to gain insight in that data and source
format
generate value, all that data should be
accessible in a central repository in some
uniform format
The data pipeline
The set of processes to automatically extract data from different sources, transform it into some uniform format and store
it in a central place defines the data pipeline

The data pipeline can also contain production models made by data scientists. Depending on the requirements these
models have to run in real-time, once per hour/day...
Data engineers need to maintain this data flow and ensure its availability and quality:
● make changes if data is added/removed
● solve bottlenecks in the pipeline
● monitor, log and solve errors
● handle duplicate, incorrect or corrupted data
● scale
● test
Workflow Management Platform
● ...

Workflow Management Platform
Image shows how we manage
this data.
We split up the data in parts,
and each split is a step, but you
don’t do every step yourself
(don’t have to reinvent the
wheel every time)




4
DAG configuration and monitoring @PrediCube

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller julievantroyen. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $10.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

53340 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$10.49  24x  sold
  • (2)
Add to cart
Added