Week 1: Introduction, file formats, python for data engineering
● What is a data pipeline? When is a data pipeline expected to finish?
Which are other, technical requirements, that are ensured by a data
engineer?
○ what is data pipeline:
■ A data pipeline is a series of data processing steps. It consists
of three key elements:
● Data source(s).
● Processing step(s).
● Destination: data warehouse.
■ different data sources within organization
■ extract data and put in central repository in structured format
via ETL (extract/transform/load)
■ data pipeline can contain machine learning models
■ data processing is either
● real-time: online/streaming
● once per day: offline/batch
○ When is a data pipeline expected to finish? (the answer on this
question are our own thoughts because we think that he didn’t say
anything about this during classes)
■ A data pipeline needs to be updated constantly and must be
available at all times to support the business processes of the
organization. Therefore, a data pipeline is only expected to
finish if a better data pipeline is implemented or if the
business processes (which this data pipeline supports) cease
to exist.
■ Real-time := online/streaming processing (link week 8)
● Eg. User goes to Dreamland: the products they get on
the page is real-time, there’s some query that goes to
database and they get result immediately
■ Once per hour/day := offline/batch processing (link week 8)
○ Data engineer:
■ data engineer is responsible for implementing necessary
components for managing the data flow to enable data
scientists to do analysis and gain necessary insights
1
, ■ data engineer ensures processing is:
● scalable: support huge amount of users (link with
distributed processing)
● reliable/available: min downtime and operational robust
(back-ups and online appli’s available 24/7)
● maintainable: support continuous change (software
and hardware updates)
● We saw three different data models for representing data? Name and
provide a short summary of each data model.
○ The relational model:
■ Consists of tables and rows (or tuples /records)
■ Each column contains primitive value such as string, integer,
float or date
■ Two types of tables:
● Entities, i.e. Persons, groups, objects
● Relations between entities: i.e. part-of, has-a, has-many,
linked-to
■ Each table can be saved as Comma-Seperated-Values (or CSV)
file
Strengths Weaknesses
structured static and less flexible schema
schema checking joins = necessary evil (they are
complex)
natural model for batch
processing
flexible queries
○ The document-oriented model:
■ Consists of keys and documents, that is, each key is associated
with one document
■ Document is a tree containing:
● Primitive values
● Nested entities
● On-to-many relations
■ Each document can be stored (and transferred) in JSON or XML
Strengths Weaknesses
structured no static schema checking
flexible: dynamic scheme less flexible queries
checking
natural model for tree many intra document relations
2
, structured data
performance
○ The graph-oriented model:
■ Consists of nodes and edges
■ A node is an instance of an entity and has a unique ID
■ An edge is a relation between two nodes and has a unique ID
■ A node and edge have named properties with a primitive value
Strengths Weaknesses
structured no static schema checking
flexible: schema can be easily used less in industry
changed (academic model)
natural model for when used in domains where
everything is connected with everything is connected
each other f.ex. social through everything (not really a
networks weakness said Len)
variable number of joins
● What are the strengths and weaknesses of the relation model versus the
document-oriented model? Which model would you prefer?
Relational model Document-oriented model
Strengths Weaknesses Strengths Weaknesses
structured static and less structured no static
flexible schema schema
checking
schema joins = flexible less flexible
checking necessary evil queries
natural model natural model many Intra
for batch (when data is document
processing tree-structured relations
with few intra
document (or
many-to-many)
relations)
flexible queries performance
○ Which model would you prefer?
3
, ■ Each model is widely used for different purposes, there is no
one-size-fits-all solution !!!
■ Decision depends on domain, that is, the structure of the data and
type of application
■ Mixed systems are available, for instance, JSON columns are
supported in most Relational databases these days.
● Which file formats are used for storing and communication data?
Provide two short examples in JSON and XML for storing student
grades.
○ CSV = Comma-Seperated-Values:
■ A plain text format
■ Represents single table in relational data model
■ values can be surrounded by “ “ marks.
■ Used very commonly for batch processing, export/input
larger amounts of data
■ Easy to partition, (i.e. 2020-10-01_sales.csv,
2020-10-02_sales.csv (= sales data for each month))
■ Can be easily compressed using zip
⇒ CSV is niet echt gebruikt voor communicating data dus denk bij deze vraag
da ge alleen JSON en XML moet geven
○ JSON = JavaScript Object Notation:
■ A plain text format
■ Same syntax as data in Python and Javascript
■ Represents single tree of data in document-oriented model
■ makes use of arrays and dictionaries
■ Common format for sharing data between client (browser)
and server or communicating data between any two
applications / services
■ For configuration of applications / services
■ Typically single JSON documents is small, but NoSQL
databases such as MongoDB store millions of documents
with a unique ID for each document
○ XML = eXtensible Markup Language:
■ Represents single tree of data in document-oriented model
■ Common format for sharing data between client (browser)
and server or communicating data between any two
applications / services
■ instead of arrays and dicts, it uses TAGS (<>) with
attributes
■ For communication and configuration of applications /
services
■ XHTML, for formatting web-pages, is a type of XML
■ (As the name suggests XML is not really a format, but a
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller arnoverlinden2014. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $8.41. You're not tied to anything after your purchase.