,Table of Contents
I. Week 1 ....................................................................................................................... 7
1. Welcome to data engineering ............................................................................................. 7
1.1 Overview: ......................................................................................................................................... 7
1.2 Pop quiz ............................................................................................................................................ 7
1.3 Differentiating between data engineering and data science? ......................................................... 7
1.4 About this course............................................................................................................................ 10
1.5 Exam ............................................................................................................................................... 12
2. File formats .......................................................................................................................12
2.1 Overview: ....................................................................................................................................... 12
2.2 Formats........................................................................................................................................... 13
a) Human readable formats .................................................................................................................... 13
b) Not human readable and compressed file formats ............................................................................ 18
2.5 When to use what? ........................................................................................................................ 20
3 Python concepts ................................................................................................................21
3.1 Overview......................................................................................................................................... 21
3.2 Programming paradigms ................................................................................................................ 21
3.3 Python functions are first-class objects .......................................................................................... 21
3.4 Anonymous lambda function ......................................................................................................... 21
3.5 Passing functions to other functions .............................................................................................. 22
3.6 Sorting elements of an iterable ...................................................................................................... 22
3.6.1 Sorting (index for python starts at 0) ............................................................................................. 22
3.7 Partial functions.............................................................................................................................. 23
3.8 Collections ...................................................................................................................................... 23
3.8.1 defaultdict ...................................................................................................................................... 23
3.8.2 counter ........................................................................................................................................... 24
3.9 Map the elements of an iterable .................................................................................................... 24
3.10 Itertools: zip (je kan over meerdere lists tegelijkertijd itereren).................................................... 24
3.11 Itertools .......................................................................................................................................... 24
3.11.1 combinations (vindt alle mogelijke combinaties tussen de verschillende elementen in een list
(afh. van de parameters die je meegeeft) ................................................................................................... 24
3.11.2 permutations.............................................................................................................................. 25
3.12 One-line dot-product ...................................................................................................................... 25
3.13 Unicode .......................................................................................................................................... 25
3.14 Dates and times .............................................................................................................................. 27
II. Week 2 ..................................................................................................................... 28
1. Computer architecture and os ...........................................................................................28
1.1 Basic computer architecture and operating systems ......................................................28
1.1.1 At the end of this lecture ................................................................................................................ 28
1.1.2 Why do I need to know about computer architecture? ................................................................. 28
1.1.3 The main components of a computer ............................................................................................ 28
1.1.4 The clock frequency (speed of CPU) and the architecture of the CPU influence the number of
instructions that can be executed per second............................................................................................. 29
1.1.4.1 Parallelism will have a bigger impact on your data processing.................................................. 29
1.1.4.2 Modern CPU packages contain multiple cores improving parallelism ....................................... 29
2
, 1.1.4.3 The speed gap between CPU and memory ................................................................................ 30
1.1.4.4 Caching to optimize your data flow ........................................................................................... 30
1.1.5 Disks: high capacity, very slow storage devices.............................................................................. 31
1.1.6 Hard Disk Drive (HDD) .................................................................................................................... 31
1.1.6.1 HDD latency................................................................................................................................ 32
1.1.7 Solid State Disks (=transistors) ....................................................................................................... 32
1.1.8 Scaling: vertical vs horizontal ......................................................................................................... 33
1.2 Operating system level ..................................................................................................34
1.3 What is an operating system (OS) .................................................................................................. 34
1.3.1 The operating system hides some of the hardware complexity..................................................... 34
1.4 Process management: .................................................................................................................... 35
1.4.1 Process Control Block (PCB)............................................................................................................ 36
1.4.2 Threads (exists within the same process) and concurrency ........................................................... 36
1.4.3 Scheduling ...................................................................................................................................... 37
1.5 Memory management and virtual memory ................................................................................... 37
1.6 Inter-process communication ........................................................................................................ 39
1.7 Input/Output management ............................................................................................................ 39
1.8 File systems as a way to organize files on (secondary) memory .................................................... 39
1.9 Directory structure ......................................................................................................................... 39
1.10 Distributed File Systems (DFS) ........................................................................................................ 40
1.11 Virtualization: virtual machine ....................................................................................................... 40
1.12 Containers: lightweight virtualization ............................................................................................ 41
2. Regular expressions ...........................................................................................................41
2.1 Overview......................................................................................................................................... 41
2.2 Regular expression = regex ............................................................................................................. 41
2.2.1 Extracting email addresses of all people registered for Data Engineering ..................................... 42
2.2.2 Applications .................................................................................................................................... 43
2.2.3 Regular expressions are like a mini-language where certain characters have a special meaning . 43
2.2.4 Searching for a literal...................................................................................................................... 44
2.2.5 Match any character with . ............................................................................................................. 44
2.2.5.1 Match a set of characters........................................................................................................... 44
2.2.5.2 Match a range of characters ...................................................................................................... 45
2.2.5.3 Negate a set of characters ......................................................................................................... 45
2.2.6 Some predefined set of characters ................................................................................................ 45
2.2.7 Repeat a pattern one or zero times (make it optional) .................................................................. 46
2.2.8 Repeat a pattern one or more times .............................................................................................. 46
2.2.9 Repeat a pattern exactly n times .................................................................................................... 46
2.2.10 Repeating operators are greedy ................................................................................................ 46
2.2.11 Capturing and non-capturing groups ......................................................................................... 46
2.2.12 Lookahead .................................................................................................................................. 47
2.2.13 Lookbehind................................................................................................................................. 47
2.2.14 Regexes gone wrong .................................................................................................................. 48
2.2.15 Put an ad on all urls that contain the name of the telecom operator ......... Error! Bookmark not
defined.
2.2.16 Why [^t] ? It’s easy to miss edge cases, might not be perfect ................................................... 48
2.3 Concluding remarks ........................................................................................................................ 48
2.4 Extra................................................................................................................................................ 49
3. Computer networks ...........................................................................................................50
3.1 Based on computer networking: a top-down approach by Kurose and Ross................................. 50
3.1.1 The internet described in terms of its hardware components ....................................................... 50
3.1.2 Hosts and the client-server model ................................................................................................. 50
3
, 3.1.3 Protocol .......................................................................................................................................... 51
3.1.4 Packet Switching ............................................................................................................................ 51
3.1.5 The internet protocol stack ........................................................................................................... 52
3.1.6 Layered structure and protocols: the protocol stack ..................................................................... 52
3.2 Network applications...................................................................................................................... 53
3.2.1 HTTP – HyperText Transformation Protocol ................................................................................... 54
3.2.1.1 The HTTP request message ........................................................................................................ 54
3.2.1.2 The HTTP headers: ..................................................................................................................... 55
3.2.1.3 The general structure of the HTTP request message................................................................ 55
3.2.1.4 The HTTP response message ...................................................................................................... 55
3.2.1.5 The general structure of the HTTP response message ............................................................. 56
3.2.1.6 If using non-secure http (not https), one’s login and password are send in clear text. ............. 56
3.2.1.7 HTTP using the TCP transport layer protocol to send its messages ........................................... 56
3.2.1.8 Addressing processes ................................................................................................................. 57
3.2.2 DNS – Domain Name System.......................................................................................................... 57
3.2.2.1 Why distributed? ....................................................................................................................... 58
3.2.2.2 DNS client used by a web browser............................................................................................ 58
III. Week 3: ................................................................................................................. 59
1. Cloud services ....................................................................................................................59
I.1 Overview .......................................................................................................................59
I.2 Defining CS ....................................................................................................................59
I.2.1 What is the cloud? .......................................................................................................................... 59
I.2.2 Managing your cloud account ........................................................................................................ 59
I.2.3 The cloud stack ............................................................................................................................... 59
I.2.4 Examples of SaaS on AWS .............................................................................................................. 59
I.2.5 Advantages of cloud computing ..................................................................................................... 60
I.2.6 Geographic organization of AWS .................................................................................................... 60
I.2.7 Virtualization .................................................................................................................................. 60
I.2.8 Native and hosted virtualization .................................................................................................... 61
I.3 Core AWS services .........................................................................................................61
I.3.1 Virtual server hosting: AWS EC2 ..................................................................................................... 61
I.3.2 EC2 price models ............................................................................................................................ 61
I.3.3 VPS: Virtual Private Cloud............................................................................................................... 63
I.3.4 Identity Access Management (IAM): Permissions and roles .......................................................... 63
I.3.5 EBS: Elastic Block Storage ............................................................................................................... 64
I.3.6 Security groups (comparable to firewall) ....................................................................................... 64
I.3.7 Storage Infrastructure .................................................................................................................... 66
I.3.8 Database services ........................................................................................................................... 67
I.4 Cloud architecture example ...........................................................................................68
2. The Linux Operating System ..............................................................................................69
2.1 Unix the standard operating system (OS)....................................................................................... 69
2.2 Linux: a Unix-like OS ....................................................................................................................... 70
2.3 Linux command line instructions (file manipulation) ..................................................................... 74
2.4 JQ .................................................................................................................................................... 77
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
√ Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper ecgef. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €5,48. Je zit daarna nergens aan vast.