100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Big Data Summary €6,99
In winkelwagen

Samenvatting

Big Data Summary

 4 keer verkocht

Notes for the Big Data course. It contains the slides and explanation of them.

Voorbeeld 4 van de 101  pagina's

  • 27 december 2022
  • 101
  • 2022/2023
  • Samenvatting
Alle documenten voor dit vak (1)
avatar-seller
LenkaZ
Big Data
Lesson 1

Introduction

What is Big Data?

● No fixed definition, the concept changes over time

● Megabytes, Gigabytes, Terabytes, Petabytes, Exabytes, Zettabytes, …

● In the past:

● Storage was expensive

● Only the most crucial data was preserved

● Most companies did no more than consult historical data, rather than analyse it



Storing the Data

● Recent trends:

● Storage is (relatively) cheap and easy

● Companies and governments preserve huge amounts of data

● There is a lot more data being generated

● Customer information, historical purchases, click logs, search histories, patient
histories, financial transactions, GPS trajectories, usage logs,
images/audio/video, sensor data, …

● More and more companies and governments rely on data analysis

● Recommender systems, next event prediction, fraud detection, predictive
maintenance, image recognition, COVID contact tracing, …



Making Data Useful

● However:

● Data analysis is computationally intensive and expensive

● Examples

● Online recommender systems: require instant results


1

, ● Frequent pattern mining: time complexity exponential in the number of
different items, independent of the number of transactions (e.g., market basket
analysis)

● Multi-label classification: exponential number of possible combinations of labels
to be assigned to a new sample (e.g., Wikipedia tagging)

● Subspace clustering: exponential number of possible sets of dimensions in
which clusters could be found (e.g., customer segmentation)

So what is Big Data?

● Dependent on the use case

● Data becomes Big Data when it becomes too large or too complex to be analyzed with
traditional data analysis software

● Analysis becomes too slow or too unreliable

● Systems become unresponsive

● Day-to-day business is impacted



Three aspects of Big Data

● Volume

● The actual quantity of data that is gathered

● Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/transaction, …

● Variety

● The different types of data that are gathered

● Some attributes may be numeric, others textual

● Structured v unstructured data

● Irregular timing

● Sensor data may come in regular time intervals, accompanying log data
are irregular

● The variety of the data increases the complexity of the analysis of the data

● Velocity

● The speed at which new data is coming in and the speed at which data must be handled

● May result in irrecoverable bottlenecks


2

,What can we do about it?

● Invest in hardware

● Store more data

● Process the data faster

● Typically (sub)linearly faster – doesn’t help much if an algorithm has exponential
complexity

● Design intelligent algorithms to speed up the analysis

● Specifically make use of available hardware resources

● Provide good approximate results at the fraction of the cost/time

● Take longer to build a model that can then be used on-the-fly

● We focus on the latter



Parallel computing

Goal: leveraging the full potential of your multicore multiprocessor multicomputer system

● If you have to process large amounts of data it would be a shame not to use all n cores of a
CPU.

● If a single system does not suffice, how can you set up multiple computers so that they work
together to solve a problem? For instance, you can rent a cluster of 100 instances using the
cloud to do some computations that take 10 hours, but then what?

Hardware has a lot of potentials that algorithms don’t always make use of.
Nowadays, even a single computer has typically multiple processors and each processor has
multiple cores -> so there are already on single computer-level ways that you can make use of
parallelization
Parallelization comes into play even more when you have multiple computers or cloud
computing.
If you have to process large amounts of data and you have multiple cores, multiple processors
and multiple computers at your disposal then you should make the most of that and parallelize
your work as much as possible

Goal of parallel processing is to reduce computation time




3

, ● Algorithms are typically designed to solve a problem in a serial fashion. To fully leverage the
power of your multicore CPU you need to adapt your algorithm: split your problem into smaller
parts that can be executed in parallel

● We can’t always expect to parallelize every part of the algorithm, however in some cases it is
almost trivial to split the entire problem in smaller parts that can run in parallel, i.e.
embarrassingly parallel

● In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel on two
cores should halve the running time

Parallel processes can’t really help reduce the complexity of the theoretical algorithm, but we
can still cut down on runtimes
How do we do this? We try to split the problem / the computation / the code into smaller parts
that can be independently executed in parallel. Key word: independently, because if a part of
your process depends on each other ( one process has to wait for the output of the previous
process), then you can’t let them run in parallel. So it is not always possible to parallelize your
processes, but a lot of the processes are easily parallelized.
If you can fully parallelize processes then the ultimate goal is linear speedup, which mean that
your speed up is speed up achieved is linear, proportional to the number of processes that you
are running. E.g., if you are running 10 processes then your total run time should be 10 times
shorter than the original.



Parallel computation
● Instruction level parallelism (pipelining, out-of-order execution) is completely transparent to
the user

● Task parallelism: multiple tasks are applied on the same data in parallel

● Data parallelism: a calculation is performed in parallel on many different data chunks




2 main types of parallelization
- In task parallelism you will run multiple tasks on the same data in parallel
- In data parallelism you will split the data, so you will run the same task on different
chunks ( parts) of the data
Regardless of what type of parallelism you have, the goal is to shorten the run times. In the
ideal world you will have linear speedup.

4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper LenkaZ. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 65507 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen
€6,99  4x  verkocht
  • (0)
In winkelwagen
Toegevoegd