Week 1: Introduction, file formats, python for data engineering
● What is a data pipeline? When is a data pipeline expected to finish?
Which are other, technical requirements, that are ensured by a data
engineer?
○ what is data pipeline:
■ A data pipeline is a series of data processing steps. It consists
of three key elements:
● Data source(s).
● Processing step(s).
● Destination: data warehouse.
■ different data sources within organization
■ extract data and put in central repository in structured format
via ETL (extract/transform/load)
■ data pipeline can contain machine learning models
■ data processing is either
● real-time: online/streaming
● once per day: offline/batch
○ When is a data pipeline expected to finish? (the answer on this
question are our own thoughts because we think that he didn’t say
anything about this during classes)
■ A data pipeline needs to be updated constantly and must be
available at all times to support the business processes of the
organization. Therefore, a data pipeline is only expected to
finish if a better data pipeline is implemented or if the
business processes (which this data pipeline supports) cease
to exist.
■ Real-time := online/streaming processing (link week 8)
● Eg. User goes to Dreamland: the products they get on
the page is real-time, there’s some query that goes to
database and they get result immediately
■ Once per hour/day := offline/batch processing (link week 8)
○ Data engineer:
■ data engineer is responsible for implementing necessary
components for managing the data flow to enable data
scientists to do analysis and gain necessary insights
1
, ■ data engineer ensures processing is:
● scalable: support huge amount of users (link with
distributed processing)
● reliable/available: min downtime and operational robust
(back-ups and online appli’s available 24/7)
● maintainable: support continuous change (software
and hardware updates)
● We saw three different data models for representing data? Name and
provide a short summary of each data model.
○ The relational model:
■ Consists of tables and rows (or tuples /records)
■ Each column contains primitive value such as string, integer,
float or date
■ Two types of tables:
● Entities, i.e. Persons, groups, objects
● Relations between entities: i.e. part-of, has-a, has-many,
linked-to
■ Each table can be saved as Comma-Seperated-Values (or CSV)
file
Strengths Weaknesses
structured static and less flexible schema
schema checking joins = necessary evil (they are
complex)
natural model for batch
processing
flexible queries
○ The document-oriented model:
■ Consists of keys and documents, that is, each key is associated
with one document
■ Document is a tree containing:
● Primitive values
● Nested entities
● On-to-many relations
■ Each document can be stored (and transferred) in JSON or XML
Strengths Weaknesses
structured no static schema checking
flexible: dynamic scheme less flexible queries
checking
natural model for tree many intra document relations
2
, structured data
performance
○ The graph-oriented model:
■ Consists of nodes and edges
■ A node is an instance of an entity and has a unique ID
■ An edge is a relation between two nodes and has a unique ID
■ A node and edge have named properties with a primitive value
Strengths Weaknesses
structured no static schema checking
flexible: schema can be easily used less in industry
changed (academic model)
natural model for when used in domains where
everything is connected with everything is connected
each other f.ex. social through everything (not really a
networks weakness said Len)
variable number of joins
● What are the strengths and weaknesses of the relation model versus the
document-oriented model? Which model would you prefer?
Relational model Document-oriented model
Strengths Weaknesses Strengths Weaknesses
structured static and less structured no static
flexible schema schema
checking
schema joins = flexible less flexible
checking necessary evil queries
natural model natural model many Intra
for batch (when data is document
processing tree-structured relations
with few intra
document (or
many-to-many)
relations)
flexible queries performance
○ Which model would you prefer?
3
, ■ Each model is widely used for different purposes, there is no
one-size-fits-all solution !!!
■ Decision depends on domain, that is, the structure of the data and
type of application
■ Mixed systems are available, for instance, JSON columns are
supported in most Relational databases these days.
● Which file formats are used for storing and communication data?
Provide two short examples in JSON and XML for storing student
grades.
○ CSV = Comma-Seperated-Values:
■ A plain text format
■ Represents single table in relational data model
■ values can be surrounded by “ “ marks.
■ Used very commonly for batch processing, export/input
larger amounts of data
■ Easy to partition, (i.e. 2020-10-01_sales.csv,
2020-10-02_sales.csv (= sales data for each month))
■ Can be easily compressed using zip
⇒ CSV is niet echt gebruikt voor communicating data dus denk bij deze vraag
da ge alleen JSON en XML moet geven
○ JSON = JavaScript Object Notation:
■ A plain text format
■ Same syntax as data in Python and Javascript
■ Represents single tree of data in document-oriented model
■ makes use of arrays and dictionaries
■ Common format for sharing data between client (browser)
and server or communicating data between any two
applications / services
■ For configuration of applications / services
■ Typically single JSON documents is small, but NoSQL
databases such as MongoDB store millions of documents
with a unique ID for each document
○ XML = eXtensible Markup Language:
■ Represents single tree of data in document-oriented model
■ Common format for sharing data between client (browser)
and server or communicating data between any two
applications / services
■ instead of arrays and dicts, it uses TAGS (<>) with
attributes
■ For communication and configuration of applications /
services
■ XHTML, for formatting web-pages, is a type of XML
■ (As the name suggests XML is not really a format, but a
4
Les avantages d'acheter des résumés chez Stuvia:
Qualité garantie par les avis des clients
Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.
L’achat facile et rapide
Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.
Focus sur l’essentiel
Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.
Foire aux questions
Qu'est-ce que j'obtiens en achetant ce document ?
Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.
Garantie de remboursement : comment ça marche ?
Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.
Auprès de qui est-ce que j'achète ce résumé ?
Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur arnoverlinden2014. Stuvia facilite les paiements au vendeur.
Est-ce que j'aurai un abonnement?
Non, vous n'achetez ce résumé que pour €7,79. Vous n'êtes lié à rien après votre achat.