100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary Antwoorden Examenvragen €7,79   In winkelwagen

Samenvatting

Summary Antwoorden Examenvragen

 76 keer bekeken  4 keer verkocht

In dit document staan uitgeschreven antwoorden van meer dan 100 vragen voor het vak Data Engineering, gegeven door Len Feremans

Voorbeeld 4 van de 48  pagina's

  • 3 juni 2021
  • 48
  • 2020/2021
  • Samenvatting
Alle documenten voor dit vak (5)
avatar-seller
arnoverlinden2014
Exam Questions Data Engineering

Week 1: Introduction, file formats, python for data engineering
● What is a data pipeline? When is a data pipeline expected to finish?
Which are other, technical requirements, that are ensured by a data
engineer?

○ what is data pipeline:




■ A data pipeline is a series of data processing steps. It consists
of three key elements:
● Data source(s).
● Processing step(s).
● Destination: data warehouse.
■ different data sources within organization
■ extract data and put in central repository in structured format
via ETL (extract/transform/load)
■ data pipeline can contain machine learning models
■ data processing is either
● real-time: online/streaming
● once per day: offline/batch
○ When is a data pipeline expected to finish? (the answer on this
question are our own thoughts because we think that he didn’t say
anything about this during classes)
■ A data pipeline needs to be updated constantly and must be
available at all times to support the business processes of the
organization. Therefore, a data pipeline is only expected to
finish if a better data pipeline is implemented or if the
business processes (which this data pipeline supports) cease
to exist.
■ Real-time := online/streaming processing (link week 8)
● Eg. User goes to Dreamland: the products they get on
the page is real-time, there’s some query that goes to
database and they get result immediately
■ Once per hour/day := offline/batch processing (link week 8)
○ Data engineer:
■ data engineer is responsible for implementing necessary
components for managing the data flow to enable data
scientists to do analysis and gain necessary insights
1

, ■ data engineer ensures processing is:
● scalable: support huge amount of users (link with
distributed processing)
● reliable/available: min downtime and operational robust
(back-ups and online appli’s available 24/7)
● maintainable: support continuous change (software
and hardware updates)

● We saw three different data models for representing data? Name and
provide a short summary of each data model.
○ The relational model:
■ Consists of tables and rows (or tuples /records)
■ Each column contains primitive value such as string, integer,
float or date
■ Two types of tables:
● Entities, i.e. Persons, groups, objects
● Relations between entities: i.e. part-of, has-a, has-many,
linked-to
■ Each table can be saved as Comma-Seperated-Values (or CSV)
file
Strengths Weaknesses

structured static and less flexible schema

schema checking joins = necessary evil (they are
complex)

natural model for batch
processing

flexible queries

○ The document-oriented model:
■ Consists of keys and documents, that is, each key is associated
with one document
■ Document is a tree containing:
● Primitive values
● Nested entities
● On-to-many relations
■ Each document can be stored (and transferred) in JSON or XML
Strengths Weaknesses

structured no static schema checking

flexible: dynamic scheme less flexible queries
checking

natural model for tree many intra document relations

2

, structured data

performance

○ The graph-oriented model:
■ Consists of nodes and edges
■ A node is an instance of an entity and has a unique ID
■ An edge is a relation between two nodes and has a unique ID
■ A node and edge have named properties with a primitive value
Strengths Weaknesses

structured no static schema checking

flexible: schema can be easily used less in industry
changed (academic model)

natural model for when used in domains where
everything is connected with everything is connected
each other f.ex. social through everything (not really a
networks weakness said Len)

variable number of joins

● What are the strengths and weaknesses of the relation model versus the
document-oriented model? Which model would you prefer?


Relational model Document-oriented model

Strengths Weaknesses Strengths Weaknesses

structured static and less structured no static
flexible schema schema
checking

schema joins = flexible less flexible
checking necessary evil queries

natural model natural model many Intra
for batch (when data is document
processing tree-structured relations
with few intra
document (or
many-to-many)
relations)

flexible queries performance

○ Which model would you prefer?
3

, ■ Each model is widely used for different purposes, there is no
one-size-fits-all solution !!!
■ Decision depends on domain, that is, the structure of the data and
type of application
■ Mixed systems are available, for instance, JSON columns are
supported in most Relational databases these days.

● Which file formats are used for storing and communication data?
Provide two short examples in JSON and XML for storing student
grades.
○ CSV = Comma-Seperated-Values:
■ A plain text format
■ Represents single table in relational data model
■ values can be surrounded by “ “ marks.
■ Used very commonly for batch processing, export/input
larger amounts of data
■ Easy to partition, (i.e. 2020-10-01_sales.csv,
2020-10-02_sales.csv (= sales data for each month))
■ Can be easily compressed using zip

⇒ CSV is niet echt gebruikt voor communicating data dus denk bij deze vraag
da ge alleen JSON en XML moet geven

○ JSON = JavaScript Object Notation:
■ A plain text format
■ Same syntax as data in Python and Javascript
■ Represents single tree of data in document-oriented model
■ makes use of arrays and dictionaries
■ Common format for sharing data between client (browser)
and server or communicating data between any two
applications / services
■ For configuration of applications / services
■ Typically single JSON documents is small, but NoSQL
databases such as MongoDB store millions of documents
with a unique ID for each document

○ XML = eXtensible Markup Language:
■ Represents single tree of data in document-oriented model
■ Common format for sharing data between client (browser)
and server or communicating data between any two
applications / services
■ instead of arrays and dicts, it uses TAGS (<>) with
attributes
■ For communication and configuration of applications /
services
■ XHTML, for formatting web-pages, is a type of XML
■ (As the name suggests XML is not really a format, but a
4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

√  	Verzekerd van kwaliteit door reviews

√ Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper arnoverlinden2014. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €7,79. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 73918 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€7,79  4x  verkocht
  • (0)
  Kopen