Academic Year: 2023 – 2024
University of Antwerp
SOLVED EXAM QUESTIONS
DATA ENGINEERING
prof. L. Feremans
, THEORY
Introduction
1. What is a data pipeline? What are the different types of data processing and what is the role of the data
engineering in its development? Give an example of a data pipeline in e-commerce.
A data pipeline is a method in which raw data is extracted from various data sources (e.g., inventory
management system, salesforce system, google reviews, …) transformed into a usable format and then loaded
into a centralized structured data repository (e.g. a data warehouse or data lake). Such a pipeline can give data
scientists a foundation to turn usable data into valuable insights by doing analysis on the data and generate
value. The pipeline may contain machine learning models itself.
Data engineers must:
- Ensure that processing, and thus the pipeline, is:
Scalable to support large amounts of data.
Reliable and Available: with minimal downtime and operational robustness. This can be achieved
with multiple servers and an online copy to minimize downtime in case of issues.
Maintainable: it must support continuous changes.
- Implement components to manage the data pipeline:
ETL (Extract/Transform/Load): data is extracted from sources, transformed into a suitable format,
and then loaded into the repository (data warehouse/data lake).
ELT (Extract/Load/Transform): data is first extracted from sources and loaded into the data
warehouse and then transformed.
- Enable data scientists to perform analyses on the data to extract insights and value .
Types of data processing:
1. Real-time processing: online processing where data is processed as soon as it arrives, suitable for
applications requiring immediate insights. Suitable for environments like financial trading platforms or
online gaming, where immediate data processing is crucial for real-time decision-making.
2. Streaming (near real-time processing): data is processed almost immediately after it is generated,
event-based, suitable for monitoring and alert systems. Ideal for environments like social media
monitoring or sensor data analysis in IoT (Internet of Things) devices, where data needs to be
processed almost instantly to trigger alerts or updates.
3. Batch processing (offline processing): data is collected over a period and processed in batches,
suitable for reporting (e.g., hourly or daily reports).
Background information:
During the transformation phase of the data pipeline, the data engineer will be concerned with:
- Aggregating the data
- Parsing the data (from one format to the other)
Example in e-commerce: Personalized Product Recommendations
,Sources of data: Include user clicks on website, user-related information on the website, buying history of
transactional databases, customer reviews.
Data will then be extracted out of these data sources and transformed intended to forge customer profiles.
This implies:
- Parsing the data
- Aggregating the data
Finally it will be pushed and loaded into a centralized data repository, i.e., a data warehouse/data lake (e.g.,
Amazon Redshift or Google BigQuery). This is an ETL pipeline, but another way of processing the data is ELT,
where the two last steps are reversed.
Empowering Data Scientists: Data scientists gain access to the central repository to analyze customer behavior
and preferences. Predictive ML models are developed to predict customer preferences and recommend
products. In this case, ML technique: collaborative filtering can be used, basing the recommendation on
similarity.
The predictive model is then integrated into the pipeline to provide recommendations. For the latter, we should
know in which nature they should occur. Should we be able to give them immediately, nearly immediately, or
can we just provide it daily/weekly, …
This is dependent on the type of data processing. Type is based on the speed of the data processing, i.e., “to
which extent is the data processed as soon as the data becomes available (e.g., the customer buys product)”
A. Immediately -> Real-time processing (immediate recommendation)
B. Nearly immediately -> Streaming (recommendation after a minute or so)
C. In regular timestamps (daily/weekly) -> Batches (recommendation just occurs at the end
of the week, or at each Thursday you’ll receive an e-mail in your inbox).
2. What is the three-tier architecture? Describe the function and common technologies used in each layer.
Give an example of a three-tier architecture pipeline in e-commerce.
A three-tier architecture is a system architecture that divides a system or application into three logical and
physical layers. Each layer has its own specific roles and responsibilities. In system design, it adheres to the
separation of concerns principle that implies that one task should contain one change driver. That means
that implementations in terms of the presentation (UI) / application logic or data storage/retrieval can and
thereby should happen independently. This architecture does mitigate ripple effects of changes in
implementations in any of the layers. Moreover, each tier can be designed simultaneously by a separate
development team. For instance, front-end developers for the presentation tier; back-end developers for
the logic tier; database engineers for the data tier.
(Example: e-commerce/online web shop)
Presentation tier: This is the top-level or User Interface. It is responsible for translating requests/tasks
by the client and results to something that the client understands. In order to fulfill the needs of the
client, this UI sends the request to the Logic tier to handle it and to receive the result back to it so it
can display it. A webpage of the online web shop where the user can pursue actions, such as viewing a
product, buying it, registering themselves, adding products to the basket, paying, etc.
Business Logic tier: This is the second layer and is responsible for coordinating the application and
handling the requests it receives from the UI layer. It makes logical evaluations based on these tasks
and executes operations by retrieving the data of the Data tier, by sending a request to this layer. It
moves and processes data between the UI and Data tier. This layer would handle the actions and
requests of the user of the webshop by validating user information, calculating order totals, checking
the availability of the product, and managing the order status.
, Data tier: This is the third layer where data is stored and provided by a database or file system. After
receiving the request of the Logic tier, it sends back the necessary data so the logic tier can process the
data and do the necessary operations to produce the results necessary and propagate it back to the
Presentation tier that makes sense of the result by presenting it properly to the client. The data that
this layer stores with respect to the webshop example would be the data of the transactions made by
users of the webshop, customer data, product data, etc.
Summary:
PT
Function: Interaction points with the end-user. Receiving requests, sending requests to LT and properly
presenting the results to the client
Applied to Example: Webpage or website through which the user can pursue certain actions/transactions (buy
a product, view a product, …). Facilitated by a web server and tools such as HTML, JavaScript, PHP, CSS, …
LT
Function: Receive requests of PT, execute operations and logical evaluations based on requests, Demand
necessary data objects to the DT by sending a request to it, Send back the results to the PT after necessary
calculations, operations and evaluations are finished.
Applied to Example: Handling actions by the users of the e-commerce website, such as providing product
availability, fill-in form to request a quote, calculating a total or sub-total, adding a set of products to the basket,
... Facilitated by an application server and tools such as Python / Java.
DT
Function: Data persistency. It is responsible for storing the data and providing the data upon request of the LT.
Applied to Example: user data, product data, sales data, user clicks, ... Facilitated by an relational database
server/cloud storage and tools such as DBeaver / SQL.
3. Give three reasons why an organization would collect large datasets. Briefly discuss the strengths
(personalization, optimization of the supply chain, data-driven decision-making) and challenges (big
data, latency) of data-intensive applications. Give an example in e-commerce.
There are several reasons why enterprises collect large datasets:
1. Enhance Customer Engagement: Organizations can use large datasets to analyze customer behavior and
preferences online in real-time, enabling them to offer personalized recommendations and
experiences. This is done by an algorithm that analyzes the data in milliseconds.