EXAM QUESTION DATA ENGINEERING
Table of Contents
WEEK 1: INTRODUCTION, FILE FORMATS, PYTHON FOR DATA ENGINEERING ............................................... 4
1. What are the differences between a data engineer and a data scientist? What are common technical
and functional requirements or tasks ensured by a data engineer? 4
2. How are integer, decimal numbers, text and images stored in a computer? Give an example of binary
encoding for each type. 4
3. We saw three different data models for representing data? Name and provide a short summary of
each data model. 5
4. What are the strengths and weaknesses of the relation model versus the document-oriented model?
Which model would you prefer? 6
5. Which human-readable file formats are used for storing and communicating data? Provide short
examples in JSON and XML for storing social network profile of a user. 6
6. Protocol Buffers and Apache Parquet are two binary formats used for storing or communicating data.
Explain how these formats work? What are the advantages of using a binary format over a text format,
such as CSV? 7
7. List, dictionary and set are the three most commonly used collections in Python. Explain the main
properties and operations of these data structures. 7
8. Suppose you want to store a student object, with a name and age and master in Python. Which data
structure would you use? 8
WEEK 2: COMPUTER ARCHITECTURE, OS, NETWORKS AND REGULAR EXPRESSIONS .................................... 8
9. How does a Central Processing Unit and Graphical Processing work? Explain what an instruction,
SIMD instruction, cycle and clock frequency are. What is Moore’s law and what can we expect from
processors in the future. 8
10. The operating system has several responsibilities. Explain what the process manager does, multi-
tasking and the process queue. 9
11. The operating system provides two techniques to run programs in parallel: processes and threads.
Explain both techniques. What are the advantages and disadvantages of each? 9
12. What is the internet protocol stack? Which steps occur when an email it transmitted from my iPhone
and received on a Windows desktop. Explain the difference between the application, transport, internet
and link layer. 10
13. What is an API? Give an example in the e-commerce domain. 11
14. What happens when you enter a URL in your browser? Bijvraag: explain DNS. 11
15. Explain the Hypertext Transfer Protocol. What does each part mean in the URL
https://www.sporza.be/sport/voetbal.html? How do you use HTTP to get the content of this webpage?
Which information is shared between each client and server using the HTTP headers? 11
16. Write a regular expression that matches all strings of format FIRSTNAME.LASTNAME@UNI.BE where
FIRSTNAME and LASTNAME can contain lower-case and upper-case alphabetic characters and UNI is
either uantwerpen, ugent, kuleuven or vub. 12
WEEK 3: CLOUD SERVICES AND LINUX ...................................................................................................... 12
17. What are the advantages and disadvantages of using cloud services versus running services on
premise? Discuss the different types of services provided by cloud vendors (i.e. the cloud stack) and give
an example of each. 12
18. What is the Hadoop File System? What happens when a disk, that is part of large clusters of disk
drives managed with HDFS, crashes and is beyond repair. 13
, 19. Firewall (security group), encryption, public-private key pair, roles and responsibilities and the virtual
private cloud are all techniques used to secure cloud instances. Explain each technique. 13
20. A system administrator installs and manages multiple servers or infrastructure. Explain briefly which
tasks are required by a system administrator for setting up a new software service and ensuring smooth
operations. Give an example of a concrete installation of a website, database or login service. 14
21. What does the following Linux command do? Explain the following output and give an e xample of 5
other linux commands. 14
22. What does cat students.csv | egrep '[a-zA-Z\.]+@uantwerpen\.be' do? Give an example of 5 other
linux commands. 15
WEEK 4: ALGORITHMS AND DATA STRUCTURES........................................................................................ 15
23. Give an example in Python of an algorithm with a constant, linear, quadratic time complexity. 15
24. Given the following algorithm. How many additions are required for n=3 and n=5? What is the worst -
case time complexity? Why is this important? 15
25. Provide the pseudocode of the quicksort algorithm. Explain why the time complexity is O(n.log(n)).
16
26. Assume you want to search for a user name in an large list contain 10 9 users. Which algorithm would
you use, and how much time would it take to finish computation? 16
27. Assume you have a large collection of user names and you want to query if the name is present in this
collection. What is the complexity of using an ArrayList and a Hash table? Which data structure would you
prefer? 16
28. Assume you have a large collection of user names and you want to query if the name is present in this
collection. What is the complexity of using an ArrayList and a Binary tree? Which data structure would you
prefer? 16
29. What is an Abstract Data Type (ADT)? Give examples of an ADT and two implementations. 17
WEEK 5: RELATIONAL DATABASES ............................................................................................................ 17
30. A key principle of databases is to store data non-redundant and consistent. What does this mean?
17
31. A key principle of databases is to query data efficiently and allow concurrent access to the data?
What does this mean? 18
32. Transactions in a relational database have ACID properties: Atomic, Consistent, Isolated and Durable.
Give an example why each property is important and provide example, for instance using the bank
transfer use-case. 18
33. SQL is a declarative language, but this is not problematic since a RDBMS by default defines indexes on
primary and foreign keys and has a query optimizer. Explain how an index and query optimizer work.
18
34. What is a primary key and what is a foreign key? What does referential integrity mean? What can
happen if you delete a row from a table in this context? 19
35. What are the advantages and disadvantages of normalisation versus de -normalisation? 19
36. Give an example of an SQL schema for storing students, courses and student grades. 20
37. The users table contains the student name, enrollment number and the master they enrolled in.
Write an SQL query that returns the total number of students for each master. 20
38. The employee table contains the employee id, first and last name; and a table department with
columns for the name of the department and the employee id. Write an SQL query that returns the first
and last name for each employee in each department. 20
39. The suppliers table contains the name of each supplier and the city. The shipments table contains the
code of product, name of supplier and date of shipment. Write an SQL query that returns all products
shipped from Antwerp. 20
, 40. The suppliers table contains the name of each supplier and the city. The shipments table contains the
code of product, name of supplier and date of shipment. Write an SQL query that returns the total
amount of products shipped from each city. 20
WEEK 6: DATA WAREHOUSING AND NOSQL ............................................................................................. 21
41. Give an example of a data cube with 3 dimensions. 21
42. Typical Online Analytical Operations are pivoting, slicing, drill-down and roll-up. Give an example of
each type of operations and explain.21
43. The following table contains details on shipments. Give two examples of analytic queries where you
investigate the number of shipments using different dimensions. 21
44. What are the four main categories of NoSQL databases? Briefly explain what type of data each
category of database stores. 21
45. What is a key-value store? For which application is it used? 22
46. What are the advantages and disadvantages of sharding and re plication? Explain why you would
combine both techniques. 22
WEEK 7: VISUALISATION ........................................................................................................................... 23
47. Explain perceptual accuracy and why it’s important. What type of visualisation would you prefer to
plot where both dependent and independent variables are quantitative? What type of visualization would
you prefer when both dependent and independent are categorical (or nominal)? 23
48. Assume you have a dataset with one target attribute and 10 continuous features. Explain and name
three techniques you would employ to visualise this dataset? 23
WEEK 8: PARALLEL AND DISTRIBUTED COMPUTING/MAPREDUCE............................................................. 24
49. Give an example of data- and task- parallelism. Give an example of both types. 24
50. How would you add a large set of numbers in parallel? Or perform an inner product on two large
vectors in parallel? Do you expect a speedup linear in the number of processors for these cases? Motivate
your answer. 24
51. What are the main properties of a distributed system? How is it different from parallel processing on
a single system? What happens when a node fails? 24
52. Explain the differences between services, batch-processing and streaming-processing. Give an
example of each type of system. 25
53. Map-Reduce data processing tasks can be decomposed into multiple two-step stages, the first step
being Map and the second step being Reduce. Explain and give an example of the evaluation, or
distributed execution, of a program for counting words. 25
54. The following Spark code computes the total amount received for each credit account. Explain how
this program is evaluated and give an example of the computation for the example input file assuming
two nodes. 26
Bijvraag: leg DAG uit met zijn nadelen 27
Week 9: Recommender systems............................................................................................................... 27
55. The primary goal of a recommender systems is predict which item a user is most likely interested in.
What are three secondary goals? Explain. 27
56. What are the main components of a content-based recommender system? 27
57. How can you compute the similarity between two products? Give an example of content -based and
collaborative-filtering based cosine similarity? 28
53. How does item-based collaborative filtering works? 28
58. Assume you are responsible for creating a website that includes recommendations. Which software
(i.e. micro-services, load balancer), hardware (i.e. servers) and data stores (i.e. json, relational database,
csv) would you use. Provide a high-level diagram and connection between different components. 28
Written exam with oral dissemination ...................................................................................................... 29
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller audreyvanlierde. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $6.96. You're not tied to anything after your purchase.