Samenvatting

Big Data Summary

7 keer verkocht

Instelling
Tilburg University (UVT)

The document provides a summary for the course of Big Data, containing the seven lectures covered, more precisely: Lecture 1: Introduction + Parallel Computing Lecture 2: Map-Reduce and the New Software Stack Lecture 3: Map-Reduce (continued) and Spark Lecture 4: Mining Data Streams (part 1) ...

[Meer zien]

Voorbeeld 10 van de 86 pagina's

Bekijk voorbeeld

Geupload op 30 oktober 2023
Aantal pagina's 86
Geschreven in 2023/2024
Type Samenvatting

Instelling Tilburg University (UVT)
Studie Big Data
Vak Onbekend

€3,99

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Big Data

Lecture 1: Introduction + Parallel Computing..................................................................... 2
Parallel computing.................................................................................................................... 4
Examples of parallel processing................................................................................................ 7
Distributed computing.............................................................................................................. 9
Services, batch and stream processing................................................................................... 11

Lecture 2: Map-Reduce and the New Software Stack.....................................................15
Problems suited for map-reduce............................................................................................ 26

Lecture 3: Map-Reduce (continued) and Spark................................................................ 29
Relational algebra................................................................................................................... 29
More Map Reduce Applications..............................................................................................33
Spark....................................................................................................................................... 38

Lecture 4: Mining Data Streams (part 1)............................................................................ 44
Sampling a fixed proportion................................................................................................... 46
Sampling a fixed-size sample.................................................................................................. 48
Queries over a (long) Sliding Window.....................................................................................49

Lecture 5: Mining Data Streams (part 2)............................................................................ 57
Filtering data streams............................................................................................................. 57
Counting distinct elements..................................................................................................... 61
Computing moments.............................................................................................................. 64

Lecture 6: Frequent Itemset Mining for Big Data.............................................................68
Finding frequent itemsets.......................................................................................................72
Apriori algorithm.....................................................................................................................75
PCY (Park-Chen-Yu) algorithm................................................................................................. 78
Frequent itemsets in ≤2 passes.............................................................................................. 80

Lecture 7: Course Recap......................................................................................................... 82

1

,Book at http://mmds.org/

Lecture 1: Introduction + Parallel Computing
What is big data?
There is no fixed definition, the concept changes over time: Megabytes, Gigabytes, Terabytes,
Petabytes, Exabytes, Zettabytes,…
In the past:
- Storage was expensive
- Only the most crucial data was preserved
- Most companies did no more than consult historical data, rather than analyze it

Storing the data
Recent trends:
- Storage is cheap and easy
- Companies and governments preserve huge amounts of data
- There is a lot more data being generated
- Customer information, historical purchases, click logs, search histories, patient
histories, financial transactions, GPS trajectories, usage logs, images/audio/video,
sensor data,...
- More and more companies and governments rely on data analysis
- Recommender systems, next event prediction, fraud detection, predictive
maintenance, image recognition, COVID contact tracing, ...

Making data useful
Data analysis is computationally intensive and expensive.
For example:
- Online recommender systems: require instant results
- Frequent pattern mining: time complexity exponential in the number of different items,
independent of the number of transactions (e.g., market basket analysis)

2

, - Multi-label classification: exponential number of possible combinations of labels to be
assigned to a new sample (e.g., Wikipedia tagging)
- Subspace clustering: exponential number of possible sets of dimensions in which
clusters could be found (e.g., customer segmentation)

So what is big data?
It depends on the use case.
Data becomes Big Data when it becomes too large or too complex to be analyzed with
traditional data analysis software:
- Analysis becomes too slow or too unreliable
- Systems become unresponsive
- Day-to-day business is impacted

Three aspects of big data
Volume: the actual quantity of data that is gathered
- Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/transaction,...

Variety: the different types of data that is gathered (more types is more time consuming)
- Some attributes may be numeric, others textual
- Structured vs unstructured data
- Irregular timing
- Sensor data may come in regular time intervals, accompanying log data are
irregular

Velocity: the speed at which new data is coming in and the speed at which data must be
handled
- May result in irrecoverable bottlenecks

3

,What can we do about it?
Invest in hardware:
- Store more data
- Process the data faster
- Typically (sub)linearly faster - doesn’t help much if an algorithm has exponential
complexity

Design intelligent algorithms to speed up the analysis (our focus)
- Specifically make use of available hardware resources
- Provide good approximate results at the fraction of the cost/time
- Take longer to build a model that can then be used on-the-fly

Parallel computing
Goal
Leveraging the full potential of your multicore multiprocessor multicomputer system
- If you have to process large amounts of data it would be a shame not to use all n cores
of a CPU.
- If a single system does not suffice, how can you set up multiple computers so that they
work together to solve a problem? For instance, you can rent a cluster of 100 instances
using the cloud to do some computations that take 10 hours, but then what?

Reduce computation time
The goal of parallel processing is to reduce computation time.
- Algorithms are typically designed to solve a problem in a serial fashion. To fully leverage
the power of your multicore CPU you need to adapt your algorithm: split your problem
into smaller parts that can be executed in parallel

4

, - We can’t always expect to parallelize every part of the algorithm, however in some cases
it is almost trivial to split the entire problem in smaller parts that can run in parallel, i.e.
embarrassingly parallel
- In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel
on two cores should halve the running time

Parallel computation
Task parallelism: multiple tasks are Data parallelism: a calculation is performed in
applied on the same data in parallel parallel on many different data chunks

Task & Data parallelism

5

,Analysis of parallel algorithms
- Parallelization can speed up the execution time of an algorithm, but it does not change
its complexity
- When analyzing parallel algorithms one is mostly interested in the speedup that is
gained
- Typically one has to take into account the overhead of the parallelization, but the
optimal speedup one can get is linear, i.e. if work is divided over 4 processors, then
execution time would be at most 4 times faster.
- A lot of operations involving large amounts of data are embarrassingly parallel. However,
in general there are data or procedural dependencies preventing a linear speedup.

Speedup
Speedup is the ratio between the time that will be required to run things serially and the actual
time that is required to run things in parallel.
Often you can’t realize a linear speedup because a part of the algorithm cannot be parallelized.

Assume:
T is the total serial execution time
p is the proportion of the code that can be parallelized
s is the number of parallel processes

Serial execution time: Tserial = p*T + (1-p)*T
𝑇
Parallel execution time: Tparallel = p* 𝑠 + (1-p)*T
𝑇𝑠𝑒𝑟𝑖𝑎𝑙 1
Speedup realized: 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
= 𝑝
(1−𝑝 + 𝑠
)

6

,Factors that prevent us from realizing a linear speedup, meaning we cannot parallelize the code:
Data dependencies

Branches

Examples of parallel processing
Adding numbers in parallel
Numbers that do not depend on each other can be parallelized.
In a serial fashion, we could add 1+2 = 3 and then 3+3 = 6 and then 6+4 = 10 and so on.
In a parallel fashion, we could group the numbers and be done in three “time units” (idles). We
did not reduce the workload, since there are seven sums to do, but we reduced the time.
However, this parallel system is costly because we “invested” first in 4 machines (cores working
parallel), then in other 2, and then in 1 final one. But investing in these machines is beneficial.

7

,Inner products
Some things are not parallelizable at all, and some can be parallelized up to a point. For example
an inner product of two vectors is parallelizable, up to the size of the vector. In this case the
vectors are of size 4. We can parallelize them with 2 cores (as in the example), but also in 4
cores.

Data processing is usually embarrassingly parallel
Often the processing of big data can happen in parallel by simply applying your function to
separate chunks of the whole data set.
- Assuming there is no communication necessary between the workers

8

,For instance:
- Define a function that takes as input the html source file of a web page and returns the
main article’s text
- Instead of applying the function sequentially to each web page, you can define several
workers that each apply the function to a large number of web pages simultaneously

Grid search
Another example where parallelization is handy relies on hyperparameters, where we have to
set different parameters in order for the algorithm to run. There are many combinations of
hyperparameters that we can choose, and the goal is to find which set of hyperparameters give
us the best result. It is interesting to do this in parallel because it will not take a long time.
Running the same algorithm with the different parameter settings in order to take the optimal
setting is something that can be parallelized. We take the data, we provide the data to each
instance, and we run the same algorithm in different settings on different machines. At the end
we can see which setting gives the best result. In this way, we can do both data and parallelism
(even if the algorithm is the same, the task is different).

Distributed computing
Parallel computing is when we have a single machine with a memory that works with different
CPUs and instructs them on what to do.
Distributed computing is when we have different machines and each machine has its own
memory and CPU that connects over a network. This is a more typical configuration when
talking about big data.

9

, Distributed computing
- It relies on heterogeneous hardware
- To solve a common problem, multiple systems work together using messages over the
network to coordinate their actions
- For the user it looks like one single system (transparent), i.e. all the details of managing
the distributed system are hidden away
- Tolerant to failures of one system: if one worker node goes down, it should not impede
the distributed system from completing its job or providing its service. No single point of
failure (reliability)
- Scalable: it should be easy to add new nodes to increase computational capacity

There is a difference in how distributed computing systems are used. They can be used for
simple parallelism, but they can also be used for services (Google for example).
From a computational perspective: to solve a common problem, multiple systems work
together using messages over the network to coordinate their actions.
From a service perspective: multiple systems are available over the internet to deliver
services to a vast amount of users. In this sense it is used to spread the load, but also to
guarantee a certain level of availability (when is Google’s search engine ever down?)

10

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.