DE
标记准则:
序列号
1.绿色表示比较理解
1.洋红表示不是理解的很清楚,要再花时间
标题全体
完成背诵
背诵不熟
Lec 1 intro
1. What is a data pipeline? What are the different types of data processing and what is the
role of the data engineering in its development? Give an example of data pipeline in e-
commerce.
• A data pipeline is a method in which raw data is extracted from various data sources,
transformed and then loaded to a data store (ETL), such as a data lake or data
warehouse, acting as a central repository where data is stored and made available for
analysis. Data pipelines can incorporate machine learning models to enhance data
processing and analysis, providing more advanced insights and predictive capabilities.
• Three types of Data processing is either:
○ Real-time processing= online processing
○ Streaming processing= near real-time
○ offline/batch processing
And data engineering ensure processing is
○ Scalable := support huge amounts of data 可扩展的
○ Reliable / Available := minimize downtime and ensure operationally robust 可靠的
○ Maintainable := support continuous changes 可维护的
• Example e-commerce pipeline:
- Data Sources:
The e-commerce site gathers data from various sources such as sales transactions,
user interactions, inventory systems, and customer feedback.
1
, - Extraction:
The data engineer sets up connectors to pull data(user, transaction, view, clicks…)
from databases, APIs, and log files. For example, extracting transaction data from an
online store's database and user behavior data from web logs.
- Transformation:
Data is cleaned and transformed to ensure consistency. For instance, transforming
date formats, aggregating sales data by product category, and filtering out incomplete
records.
- Loading:
The transformed data is loaded into a data warehouse.
- Analysis and Reporting:
Data analysts and business intelligence tools access the data warehouse to generate
reports, dashboards, and insights, such as sales trends, customer behavior analysis,
and inventory forecasting.
2. What is the three-tier architecture? Describe the function and common technologies used
in each layer. Give an example of a three-tier architecture pipeline in e-commerce.
(主要是产品,用户,交易信息…)
Three-tier architecture is a well-established software application architecture that organizes
applications into three logical and physical computing tiers.
• Presentation Tier: This is the top-most layer of the application, often referred to as
the user interface (UI).
- The main function: present data to users & interpret commands users provide
through the interface. 负责用户的信息展示&识别用户交互信息
- Tech: html, java
- Example: In an e-commerce site, this layer would be the web pages where users
browse products, add items to their cart, and check out. 呈现搜索的产品,用户
加购结账等的页面
• Business layer/ application Logic tier: this tier sits in the middle. It manages the
application’s operations by processing commands, making logical decisions, and
performing calculations.
- Function: Coordinates up and down layers: retrieving and processing data &
sending results back to presentation tier & further to the data tier for storage. 和
上下两层交互,协调数据从库中的提取,并反馈给展示层
- Tech: python
2
, - Example: handle operations such as adding items to the shopping cart, processing
payments, and managing user authentication. 负责加购、支付等过程的有序完
成
• Data tier: The lowest layer in this architecture. Information is stored in databases or
file systems and is accessed by the logic tier. This tier is responsible for maintaining
data integrity and security. It provides the logic tier with data so it can process and
then eventually return results to the user.
- Main function: data storage and retrieval, maintain data integrity and security. 保
障数据的储存和真实
- Tech: SQL
- Example: In an e-commerce site, this layer would store product details, user
information, order history, and other transactional data. 产品,用户,订单等信
息的底层储存
3. Give three reasons why an organization would collect large datasets. Briefly discuss the
strengths (personalization, optimization of the supply chain, data-driven decision-making)
and challenges (big data, latency) of data-intensive applications. Give an example in e-
commerce.
• Why collect large datasets? Why?
- Enhance customer engagement: Algorithms analyze data online and make
personalized recommendations.
> Data selected from user side can deal with questions like “Which products are
highly popular or trending? Which products are relevant for a specific
customer?”→ good for personalization and customer engagement.
- Optimisation of supply chain and daily operational activities.
> Data on administration, stock management, shipping, payments, delivery…can
be leveraged to generate values for supply chain management and daily
operation.
- Support data-driven decision making of management.
> With the help from dashboards, OLAP and data mining approaches, managers
can make better decision.
• Discussion Strength (above)
• Discussion Challenges:
- How to better store and manage data? Billions of product, users, transactions,
media data require good data management capabilities.
- How to maintain Rapid response time (or low latency).
• For an e-commerce case, the strength of collect user data is mainly: enhance user
engagement and satisfaction through personalization recommendation; data on
administration, stock management, shipping, payments, delivery…can generate
insights and facilitate partnership with supply chain partners; with a dashboard at
hand, managers can know questions like “compare the sales volumes for a specific
products in different areas or time”, thus deciding promotions.
4. What is a relational database? Define and explain the following terms and give an example
of each: Entity-relation diagram; 1-1, 1-n and n-n relations; Relational model; Online
transactional processing; SQL.
3
, • A relational database is a type of database that stores and provides access to data
points that are related to one another. Data in a relational database is organized into
tables (also known as relations) which consist of rows and columns. Each table has a
unique key that identifies its records.
• Entity-Relation Diagram (ERD): An ERD is a visual representation of the entities within
a system and the relationships between those entities. Example: An ERD might show
entities such as Employee, Department, Project, and Location, with relationships
indicating how employees work on projects, departments are located at locations, etc.
•
• Relational model: The relational model is a theoretical framework for organizing data
into collections of tables (relations) with rows (tuples) and columns (attributes). It
defines how data should be structured and how relationships between data should be
handled. Each table has rows (or records) and columns. Each table has a unique key.
Tables link with each other.
• OLTP: OLTP systems are designed to manage transaction-oriented applications. They
handle large numbers of short online transactions, ensuring data integrity in multi-
access environments. Example: An e-commerce system where customers browse
products, add items to the cart, and complete purchases. Each action (browsing,
adding to cart, purchasing) is a transaction processed by the OLTP system.
4