Resume

Big Data Summary

26 vues 4 fois vendu

Cours

Établissement
Tilburg University (UVT)

The document provides a summary for the course of Big Data, containing the seven lectures covered, more precisely: Lecture 1: Introduction + Parallel Computing Lecture 2: Map-Reduce and the New Software Stack Lecture 3: Map-Reduce (continued) and Spark Lecture 4: Mining Data Streams (part 1) ...

[Montrer plus]

Aperçu 10 sur 86 pages

Voir l'exemple

Publié le 30 octobre 2023
Nombre de pages 86
Écrit en 2023/2024
Type Resume

big data
data science

Établissement
Tilburg University (UVT)
Cours
Big Data
Cours Inconnu

3,99 €

Ajouté

Ajouter au panier

Ajouter au liste de veux

Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien

Big Data

Lecture 1: Introduction + Parallel Computing..................................................................... 2
Parallel computing.................................................................................................................... 4
Examples of parallel processing................................................................................................ 7
Distributed computing.............................................................................................................. 9
Services, batch and stream processing................................................................................... 11

Lecture 2: Map-Reduce and the New Software Stack.....................................................15
Problems suited for map-reduce............................................................................................ 26

Lecture 3: Map-Reduce (continued) and Spark................................................................ 29
Relational algebra................................................................................................................... 29
More Map Reduce Applications..............................................................................................33
Spark....................................................................................................................................... 38

Lecture 4: Mining Data Streams (part 1)............................................................................ 44
Sampling a fixed proportion................................................................................................... 46
Sampling a fixed-size sample.................................................................................................. 48
Queries over a (long) Sliding Window.....................................................................................49

Lecture 5: Mining Data Streams (part 2)............................................................................ 57
Filtering data streams............................................................................................................. 57
Counting distinct elements..................................................................................................... 61
Computing moments.............................................................................................................. 64

Lecture 6: Frequent Itemset Mining for Big Data.............................................................68
Finding frequent itemsets.......................................................................................................72
Apriori algorithm.....................................................................................................................75
PCY (Park-Chen-Yu) algorithm................................................................................................. 78
Frequent itemsets in ≤2 passes.............................................................................................. 80

Lecture 7: Course Recap......................................................................................................... 82

1

,Book at http://mmds.org/

Lecture 1: Introduction + Parallel Computing
What is big data?
There is no fixed definition, the concept changes over time: Megabytes, Gigabytes, Terabytes,
Petabytes, Exabytes, Zettabytes,…
In the past:
- Storage was expensive
- Only the most crucial data was preserved
- Most companies did no more than consult historical data, rather than analyze it

Storing the data
Recent trends:
- Storage is cheap and easy
- Companies and governments preserve huge amounts of data
- There is a lot more data being generated
- Customer information, historical purchases, click logs, search histories, patient
histories, financial transactions, GPS trajectories, usage logs, images/audio/video,
sensor data,...
- More and more companies and governments rely on data analysis
- Recommender systems, next event prediction, fraud detection, predictive
maintenance, image recognition, COVID contact tracing, ...

Making data useful
Data analysis is computationally intensive and expensive.
For example:
- Online recommender systems: require instant results
- Frequent pattern mining: time complexity exponential in the number of different items,
independent of the number of transactions (e.g., market basket analysis)

2

, - Multi-label classification: exponential number of possible combinations of labels to be
assigned to a new sample (e.g., Wikipedia tagging)
- Subspace clustering: exponential number of possible sets of dimensions in which
clusters could be found (e.g., customer segmentation)

So what is big data?
It depends on the use case.
Data becomes Big Data when it becomes too large or too complex to be analyzed with
traditional data analysis software:
- Analysis becomes too slow or too unreliable
- Systems become unresponsive
- Day-to-day business is impacted

Three aspects of big data
Volume: the actual quantity of data that is gathered
- Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/transaction,...

Variety: the different types of data that is gathered (more types is more time consuming)
- Some attributes may be numeric, others textual
- Structured vs unstructured data
- Irregular timing
- Sensor data may come in regular time intervals, accompanying log data are
irregular

Velocity: the speed at which new data is coming in and the speed at which data must be
handled
- May result in irrecoverable bottlenecks

3

,What can we do about it?
Invest in hardware:
- Store more data
- Process the data faster
- Typically (sub)linearly faster - doesn’t help much if an algorithm has exponential
complexity

Design intelligent algorithms to speed up the analysis (our focus)
- Specifically make use of available hardware resources
- Provide good approximate results at the fraction of the cost/time
- Take longer to build a model that can then be used on-the-fly

Parallel computing
Goal
Leveraging the full potential of your multicore multiprocessor multicomputer system
- If you have to process large amounts of data it would be a shame not to use all n cores
of a CPU.
- If a single system does not suffice, how can you set up multiple computers so that they
work together to solve a problem? For instance, you can rent a cluster of 100 instances
using the cloud to do some computations that take 10 hours, but then what?

Reduce computation time
The goal of parallel processing is to reduce computation time.
- Algorithms are typically designed to solve a problem in a serial fashion. To fully leverage
the power of your multicore CPU you need to adapt your algorithm: split your problem
into smaller parts that can be executed in parallel

4

, - We can’t always expect to parallelize every part of the algorithm, however in some cases
it is almost trivial to split the entire problem in smaller parts that can run in parallel, i.e.
embarrassingly parallel
- In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel
on two cores should halve the running time

Parallel computation
Task parallelism: multiple tasks are Data parallelism: a calculation is performed in
applied on the same data in parallel parallel on many different data chunks

Task & Data parallelism

5

,Analysis of parallel algorithms
- Parallelization can speed up the execution time of an algorithm, but it does not change
its complexity
- When analyzing parallel algorithms one is mostly interested in the speedup that is
gained
- Typically one has to take into account the overhead of the parallelization, but the
optimal speedup one can get is linear, i.e. if work is divided over 4 processors, then
execution time would be at most 4 times faster.
- A lot of operations involving large amounts of data are embarrassingly parallel. However,
in general there are data or procedural dependencies preventing a linear speedup.

Speedup
Speedup is the ratio between the time that will be required to run things serially and the actual
time that is required to run things in parallel.
Often you can’t realize a linear speedup because a part of the algorithm cannot be parallelized.

Assume:
T is the total serial execution time
p is the proportion of the code that can be parallelized
s is the number of parallel processes

Serial execution time: Tserial = p*T + (1-p)*T
𝑇
Parallel execution time: Tparallel = p* 𝑠 + (1-p)*T
𝑇𝑠𝑒𝑟𝑖𝑎𝑙 1
Speedup realized: 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
= 𝑝
(1−𝑝 + 𝑠
)

6

,Factors that prevent us from realizing a linear speedup, meaning we cannot parallelize the code:
Data dependencies

Branches

Examples of parallel processing
Adding numbers in parallel
Numbers that do not depend on each other can be parallelized.
In a serial fashion, we could add 1+2 = 3 and then 3+3 = 6 and then 6+4 = 10 and so on.
In a parallel fashion, we could group the numbers and be done in three “time units” (idles). We
did not reduce the workload, since there are seven sums to do, but we reduced the time.
However, this parallel system is costly because we “invested” first in 4 machines (cores working
parallel), then in other 2, and then in 1 final one. But investing in these machines is beneficial.

7

,Inner products
Some things are not parallelizable at all, and some can be parallelized up to a point. For example
an inner product of two vectors is parallelizable, up to the size of the vector. In this case the
vectors are of size 4. We can parallelize them with 2 cores (as in the example), but also in 4
cores.

Data processing is usually embarrassingly parallel
Often the processing of big data can happen in parallel by simply applying your function to
separate chunks of the whole data set.
- Assuming there is no communication necessary between the workers

8

,For instance:
- Define a function that takes as input the html source file of a web page and returns the
main article’s text
- Instead of applying the function sequentially to each web page, you can define several
workers that each apply the function to a large number of web pages simultaneously

Grid search
Another example where parallelization is handy relies on hyperparameters, where we have to
set different parameters in order for the algorithm to run. There are many combinations of
hyperparameters that we can choose, and the goal is to find which set of hyperparameters give
us the best result. It is interesting to do this in parallel because it will not take a long time.
Running the same algorithm with the different parameter settings in order to take the optimal
setting is something that can be parallelized. We take the data, we provide the data to each
instance, and we run the same algorithm in different settings on different machines. At the end
we can see which setting gives the best result. In this way, we can do both data and parallelism
(even if the algorithm is the same, the task is different).

Distributed computing
Parallel computing is when we have a single machine with a memory that works with different
CPUs and instructs them on what to do.
Distributed computing is when we have different machines and each machine has its own
memory and CPU that connects over a network. This is a more typical configuration when
talking about big data.

9

, Distributed computing
- It relies on heterogeneous hardware
- To solve a common problem, multiple systems work together using messages over the
network to coordinate their actions
- For the user it looks like one single system (transparent), i.e. all the details of managing
the distributed system are hidden away
- Tolerant to failures of one system: if one worker node goes down, it should not impede
the distributed system from completing its job or providing its service. No single point of
failure (reliability)
- Scalable: it should be easy to add new nodes to increase computational capacity

There is a difference in how distributed computing systems are used. They can be used for
simple parallelism, but they can also be used for services (Google for example).
From a computational perspective: to solve a common problem, multiple systems work
together using messages over the network to coordinate their actions.
From a service perspective: multiple systems are available over the internet to deliver
services to a vast amount of users. In this sense it is used to spread the load, but also to
guarantee a certain level of availability (when is Google’s search engine ever down?)

10

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur eleonora28. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour 3,99 €. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

67474 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 14 ans

Commencez à vendre!

Universités et collèges populaires

Livres populaires