,Table of Contents
I. Week 1 ....................................................................................................................... 7
1. Welcome to data engineering ............................................................................................. 7
1.1 Overview: ......................................................................................................................................... 7
1.2 Pop quiz ............................................................................................................................................ 7
1.3 Differentiating between data engineering and data science? ......................................................... 7
1.4 About this course............................................................................................................................ 10
1.5 Exam ............................................................................................................................................... 12
2. File formats .......................................................................................................................12
2.1 Overview: ....................................................................................................................................... 12
2.2 Formats........................................................................................................................................... 13
a) Human readable formats .................................................................................................................... 13
b) Not human readable and compressed file formats ............................................................................ 18
2.5 When to use what? ........................................................................................................................ 20
3 Python concepts ................................................................................................................21
3.1 Overview......................................................................................................................................... 21
3.2 Programming paradigms ................................................................................................................ 21
3.3 Python functions are first-class objects .......................................................................................... 21
3.4 Anonymous lambda function ......................................................................................................... 21
3.5 Passing functions to other functions .............................................................................................. 22
3.6 Sorting elements of an iterable ...................................................................................................... 22
3.6.1 Sorting (index for python starts at 0) ............................................................................................. 22
3.7 Partial functions.............................................................................................................................. 23
3.8 Collections ...................................................................................................................................... 23
3.8.1 defaultdict ...................................................................................................................................... 23
3.8.2 counter ........................................................................................................................................... 24
3.9 Map the elements of an iterable .................................................................................................... 24
3.10 Itertools: zip (je kan over meerdere lists tegelijkertijd itereren).................................................... 24
3.11 Itertools .......................................................................................................................................... 24
3.11.1 combinations (vindt alle mogelijke combinaties tussen de verschillende elementen in een list
(afh. van de parameters die je meegeeft) ................................................................................................... 24
3.11.2 permutations.............................................................................................................................. 25
3.12 One-line dot-product ...................................................................................................................... 25
3.13 Unicode .......................................................................................................................................... 25
3.14 Dates and times .............................................................................................................................. 27
II. Week 2 ..................................................................................................................... 28
1. Computer architecture and os ...........................................................................................28
1.1 Basic computer architecture and operating systems ......................................................28
1.1.1 At the end of this lecture ................................................................................................................ 28
1.1.2 Why do I need to know about computer architecture? ................................................................. 28
1.1.3 The main components of a computer ............................................................................................ 28
1.1.4 The clock frequency (speed of CPU) and the architecture of the CPU influence the number of
instructions that can be executed per second............................................................................................. 29
1.1.4.1 Parallelism will have a bigger impact on your data processing.................................................. 29
1.1.4.2 Modern CPU packages contain multiple cores improving parallelism ....................................... 29
2
, 1.1.4.3 The speed gap between CPU and memory ................................................................................ 30
1.1.4.4 Caching to optimize your data flow ........................................................................................... 30
1.1.5 Disks: high capacity, very slow storage devices.............................................................................. 31
1.1.6 Hard Disk Drive (HDD) .................................................................................................................... 31
1.1.6.1 HDD latency................................................................................................................................ 32
1.1.7 Solid State Disks (=transistors) ....................................................................................................... 32
1.1.8 Scaling: vertical vs horizontal ......................................................................................................... 33
1.2 Operating system level ..................................................................................................34
1.3 What is an operating system (OS) .................................................................................................. 34
1.3.1 The operating system hides some of the hardware complexity..................................................... 34
1.4 Process management: .................................................................................................................... 35
1.4.1 Process Control Block (PCB)............................................................................................................ 36
1.4.2 Threads (exists within the same process) and concurrency ........................................................... 36
1.4.3 Scheduling ...................................................................................................................................... 37
1.5 Memory management and virtual memory ................................................................................... 37
1.6 Inter-process communication ........................................................................................................ 39
1.7 Input/Output management ............................................................................................................ 39
1.8 File systems as a way to organize files on (secondary) memory .................................................... 39
1.9 Directory structure ......................................................................................................................... 39
1.10 Distributed File Systems (DFS) ........................................................................................................ 40
1.11 Virtualization: virtual machine ....................................................................................................... 40
1.12 Containers: lightweight virtualization ............................................................................................ 41
2. Regular expressions ...........................................................................................................41
2.1 Overview......................................................................................................................................... 41
2.2 Regular expression = regex ............................................................................................................. 41
2.2.1 Extracting email addresses of all people registered for Data Engineering ..................................... 42
2.2.2 Applications .................................................................................................................................... 43
2.2.3 Regular expressions are like a mini-language where certain characters have a special meaning . 43
2.2.4 Searching for a literal...................................................................................................................... 44
2.2.5 Match any character with . ............................................................................................................. 44
2.2.5.1 Match a set of characters........................................................................................................... 44
2.2.5.2 Match a range of characters ...................................................................................................... 45
2.2.5.3 Negate a set of characters ......................................................................................................... 45
2.2.6 Some predefined set of characters ................................................................................................ 45
2.2.7 Repeat a pattern one or zero times (make it optional) .................................................................. 46
2.2.8 Repeat a pattern one or more times .............................................................................................. 46
2.2.9 Repeat a pattern exactly n times .................................................................................................... 46
2.2.10 Repeating operators are greedy ................................................................................................ 46
2.2.11 Capturing and non-capturing groups ......................................................................................... 46
2.2.12 Lookahead .................................................................................................................................. 47
2.2.13 Lookbehind................................................................................................................................. 47
2.2.14 Regexes gone wrong .................................................................................................................. 48
2.2.15 Put an ad on all urls that contain the name of the telecom operator ......... Error! Bookmark not
defined.
2.2.16 Why [^t] ? It’s easy to miss edge cases, might not be perfect ................................................... 48
2.3 Concluding remarks ........................................................................................................................ 48
2.4 Extra................................................................................................................................................ 49
3. Computer networks ...........................................................................................................50
3.1 Based on computer networking: a top-down approach by Kurose and Ross................................. 50
3.1.1 The internet described in terms of its hardware components ....................................................... 50
3.1.2 Hosts and the client-server model ................................................................................................. 50
3
, 3.1.3 Protocol .......................................................................................................................................... 51
3.1.4 Packet Switching ............................................................................................................................ 51
3.1.5 The internet protocol stack ........................................................................................................... 52
3.1.6 Layered structure and protocols: the protocol stack ..................................................................... 52
3.2 Network applications...................................................................................................................... 53
3.2.1 HTTP – HyperText Transformation Protocol ................................................................................... 54
3.2.1.1 The HTTP request message ........................................................................................................ 54
3.2.1.2 The HTTP headers: ..................................................................................................................... 55
3.2.1.3 The general structure of the HTTP request message................................................................ 55
3.2.1.4 The HTTP response message ...................................................................................................... 55
3.2.1.5 The general structure of the HTTP response message ............................................................. 56
3.2.1.6 If using non-secure http (not https), one’s login and password are send in clear text. ............. 56
3.2.1.7 HTTP using the TCP transport layer protocol to send its messages ........................................... 56
3.2.1.8 Addressing processes ................................................................................................................. 57
3.2.2 DNS – Domain Name System.......................................................................................................... 57
3.2.2.1 Why distributed? ....................................................................................................................... 58
3.2.2.2 DNS client used by a web browser............................................................................................ 58
III. Week 3: ................................................................................................................. 59
1. Cloud services ....................................................................................................................59
I.1 Overview .......................................................................................................................59
I.2 Defining CS ....................................................................................................................59
I.2.1 What is the cloud? .......................................................................................................................... 59
I.2.2 Managing your cloud account ........................................................................................................ 59
I.2.3 The cloud stack ............................................................................................................................... 59
I.2.4 Examples of SaaS on AWS .............................................................................................................. 59
I.2.5 Advantages of cloud computing ..................................................................................................... 60
I.2.6 Geographic organization of AWS .................................................................................................... 60
I.2.7 Virtualization .................................................................................................................................. 60
I.2.8 Native and hosted virtualization .................................................................................................... 61
I.3 Core AWS services .........................................................................................................61
I.3.1 Virtual server hosting: AWS EC2 ..................................................................................................... 61
I.3.2 EC2 price models ............................................................................................................................ 61
I.3.3 VPS: Virtual Private Cloud............................................................................................................... 63
I.3.4 Identity Access Management (IAM): Permissions and roles .......................................................... 63
I.3.5 EBS: Elastic Block Storage ............................................................................................................... 64
I.3.6 Security groups (comparable to firewall) ....................................................................................... 64
I.3.7 Storage Infrastructure .................................................................................................................... 66
I.3.8 Database services ........................................................................................................................... 67
I.4 Cloud architecture example ...........................................................................................68
2. The Linux Operating System ..............................................................................................69
2.1 Unix the standard operating system (OS)....................................................................................... 69
2.2 Linux: a Unix-like OS ....................................................................................................................... 70
2.3 Linux command line instructions (file manipulation) ..................................................................... 74
2.4 JQ .................................................................................................................................................... 77
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller ecgef. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.89. You're not tied to anything after your purchase.