Samenvatting

Summary (17/20) DATA ENGINEERING: SOLVED EXAM QUESTIONS

0 keer verkocht

Instelling
Universiteit Antwerpen (UA)

This document entails the solved exam questions of Data Engineering of the Master in Digital Business Engineering in an extended fashion. This document has been established based on: - Lectures - Intuition - ChatGPT 4o => Answers on the questions verified by ChatGPT 4o. Academic Year 2...

[Meer zien]

Voorbeeld 4 van de 99 pagina's

Bekijk voorbeeld

Geupload op 5 september 2024
Aantal pagina's 99
Geschreven in 2023/2024
Type Samenvatting

data
engineering
data engineering
master dbe
digital business engineering

Volgen

giorgibala Lid sinds 11 maanden 7 documenten verkocht

€8,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Academic Year: 2023 – 2024

University of Antwerp

SOLVED EXAM QUESTIONS
DATA ENGINEERING
prof. L. Feremans

, THEORY

Introduction
1. What is a data pipeline? What are the different types of data processing and what is the role of the data
engineering in its development? Give an example of a data pipeline in e-commerce.

A data pipeline is a method in which raw data is extracted from various data sources (e.g., inventory
management system, salesforce system, google reviews, …) transformed into a usable format and then loaded
into a centralized structured data repository (e.g. a data warehouse or data lake). Such a pipeline can give data
scientists a foundation to turn usable data into valuable insights by doing analysis on the data and generate
value. The pipeline may contain machine learning models itself.

Data engineers must:

- Ensure that processing, and thus the pipeline, is:

 Scalable to support large amounts of data.

 Reliable and Available: with minimal downtime and operational robustness. This can be achieved
with multiple servers and an online copy to minimize downtime in case of issues.

 Maintainable: it must support continuous changes.

- Implement components to manage the data pipeline:

 ETL (Extract/Transform/Load): data is extracted from sources, transformed into a suitable format,
and then loaded into the repository (data warehouse/data lake).

 ELT (Extract/Load/Transform): data is first extracted from sources and loaded into the data
warehouse and then transformed.

- Enable data scientists to perform analyses on the data to extract insights and value .

Types of data processing:

1. Real-time processing: online processing where data is processed as soon as it arrives, suitable for
applications requiring immediate insights. Suitable for environments like financial trading platforms or
online gaming, where immediate data processing is crucial for real-time decision-making.

2. Streaming (near real-time processing): data is processed almost immediately after it is generated,
event-based, suitable for monitoring and alert systems. Ideal for environments like social media
monitoring or sensor data analysis in IoT (Internet of Things) devices, where data needs to be
processed almost instantly to trigger alerts or updates.

3. Batch processing (offline processing): data is collected over a period and processed in batches,
suitable for reporting (e.g., hourly or daily reports).

Background information:

During the transformation phase of the data pipeline, the data engineer will be concerned with:

- Aggregating the data
- Parsing the data (from one format to the other)

Example in e-commerce: Personalized Product Recommendations

,Sources of data: Include user clicks on website, user-related information on the website, buying history of
transactional databases, customer reviews.

Data will then be extracted out of these data sources and transformed intended to forge customer profiles.
This implies:

- Parsing the data
- Aggregating the data

Finally it will be pushed and loaded into a centralized data repository, i.e., a data warehouse/data lake (e.g.,
Amazon Redshift or Google BigQuery). This is an ETL pipeline, but another way of processing the data is ELT,
where the two last steps are reversed.

Empowering Data Scientists: Data scientists gain access to the central repository to analyze customer behavior
and preferences. Predictive ML models are developed to predict customer preferences and recommend
products. In this case, ML technique: collaborative filtering can be used, basing the recommendation on
similarity.

The predictive model is then integrated into the pipeline to provide recommendations. For the latter, we should
know in which nature they should occur. Should we be able to give them immediately, nearly immediately, or
can we just provide it daily/weekly, …

This is dependent on the type of data processing. Type is based on the speed of the data processing, i.e., “to
which extent is the data processed as soon as the data becomes available (e.g., the customer buys product)”

A. Immediately -> Real-time processing (immediate recommendation)
B. Nearly immediately -> Streaming (recommendation after a minute or so)
C. In regular timestamps (daily/weekly) -> Batches (recommendation just occurs at the end
of the week, or at each Thursday you’ll receive an e-mail in your inbox).

2. What is the three-tier architecture? Describe the function and common technologies used in each layer.
Give an example of a three-tier architecture pipeline in e-commerce.

A three-tier architecture is a system architecture that divides a system or application into three logical and
physical layers. Each layer has its own specific roles and responsibilities. In system design, it adheres to the
separation of concerns principle that implies that one task should contain one change driver. That means
that implementations in terms of the presentation (UI) / application logic or data storage/retrieval can and
thereby should happen independently. This architecture does mitigate ripple effects of changes in
implementations in any of the layers. Moreover, each tier can be designed simultaneously by a separate
development team. For instance, front-end developers for the presentation tier; back-end developers for
the logic tier; database engineers for the data tier.
(Example: e-commerce/online web shop)

 Presentation tier: This is the top-level or User Interface. It is responsible for translating requests/tasks
by the client and results to something that the client understands. In order to fulfill the needs of the
client, this UI sends the request to the Logic tier to handle it and to receive the result back to it so it
can display it. A webpage of the online web shop where the user can pursue actions, such as viewing a
product, buying it, registering themselves, adding products to the basket, paying, etc.

 Business Logic tier: This is the second layer and is responsible for coordinating the application and
handling the requests it receives from the UI layer. It makes logical evaluations based on these tasks
and executes operations by retrieving the data of the Data tier, by sending a request to this layer. It
moves and processes data between the UI and Data tier. This layer would handle the actions and
requests of the user of the webshop by validating user information, calculating order totals, checking
the availability of the product, and managing the order status.

,  Data tier: This is the third layer where data is stored and provided by a database or file system. After
receiving the request of the Logic tier, it sends back the necessary data so the logic tier can process the
data and do the necessary operations to produce the results necessary and propagate it back to the
Presentation tier that makes sense of the result by presenting it properly to the client. The data that
this layer stores with respect to the webshop example would be the data of the transactions made by
users of the webshop, customer data, product data, etc.

Summary:

PT

Function: Interaction points with the end-user. Receiving requests, sending requests to LT and properly
presenting the results to the client

Applied to Example: Webpage or website through which the user can pursue certain actions/transactions (buy
a product, view a product, …). Facilitated by a web server and tools such as HTML, JavaScript, PHP, CSS, …

LT

Function: Receive requests of PT, execute operations and logical evaluations based on requests, Demand
necessary data objects to the DT by sending a request to it, Send back the results to the PT after necessary
calculations, operations and evaluations are finished.

Applied to Example: Handling actions by the users of the e-commerce website, such as providing product
availability, fill-in form to request a quote, calculating a total or sub-total, adding a set of products to the basket,
... Facilitated by an application server and tools such as Python / Java.

DT

Function: Data persistency. It is responsible for storing the data and providing the data upon request of the LT.

Applied to Example: user data, product data, sales data, user clicks, ... Facilitated by an relational database
server/cloud storage and tools such as DBeaver / SQL.

3. Give three reasons why an organization would collect large datasets. Briefly discuss the strengths
(personalization, optimization of the supply chain, data-driven decision-making) and challenges (big
data, latency) of data-intensive applications. Give an example in e-commerce.

There are several reasons why enterprises collect large datasets:

1. Enhance Customer Engagement: Organizations can use large datasets to analyze customer behavior and
preferences online in real-time, enabling them to offer personalized recommendations and
experiences. This is done by an algorithm that analyzes the data in milliseconds.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

√ Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper giorgibala. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €8,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 62774 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire universiteiten

Populaire hogescholen

Populaire studieboeken voor Communicatie en Taal

Populaire studieboeken voor Economie en Bedrijf

Populaire studieboeken voor Exact en Informatica

Populaire studieboeken voor Gedrag en Maatschappij

Populaire studieboeken voor Gezondheid en Geneeskunde

Populaire studieboeken voor Recht en Bestuur

Verkoper