Interactive Data Transforming | Lecture 5
Data Flow Model
The model is a way to visualize how data moves through an algorithm. It looks like a directed graph
where data flows between different operations or tasks. Construction goals:
Improve expressiveness and extensibility You want to be able to create complex algorithms
easily and allow for changes or additions later.
Making coding easier: strive for high-level code
Enable additional optimizations
Increase performance by better utilizing the hardware (particularly RAM)
Representative examples: Apache Spark are used for various tasks in data engineering, science, and
machine learning.
Spark
Spark is designed based on the ideas of MapReduce but is faster because it processes data in memory
(RAM) instead of relying on a file system.
Lambda Expressions
Small functions without a name, any number of arguments, only one expression is executed.
Example:
Map for iterables
Executes the function on the element of iterable(s). Returns an iterator that contains the elements
resulted after applying the function. Example:
It will square each number in elem_list and put these squared numbers in the new_elem list.
Filter for iterables
Function should return a Boolean. Filter executes the function over each element of iterable. Returns
an iterator that contains the elements for which functions resulted at True. Example:
It will only show the grades that are higher than 5. So, the output
would be: [{‘name’: ‘John’, ‘exam’: 9}, {‘name’: ‘Anna’, ‘exam’: 8}]
, Storage Layer
Requirements Same as the storage layer in lecture 4, scalability etc. but including:
Fast RAM for hot data: recent data stored in RAM
Hadoop uses slow HDD (hard disk
drive) storage, which can handle large
amounts of data but may be slower in
accessing it. It’s designed for large
datasets that don’t require immediate
processing.
Apache Spark utilizes in-memory storage, which allows for much faster data processing compared to
HDD storage because it keeps data in RAM. It can also handle overflow by utilizing disk storage if the
data exceeds memory capacity.
RDDs (Resilient Distributed Datasets)
D Data set (Collection of data. Array, table, data frame, etc.)
D Distributed (Parts are placed on different computers)
R Resilient (Recover from failures)
It’s created by: (1) Loading data from stable storage, e.g. from HDFS. (2) Manipulation of existing
RDDs. You can create new RDDs by transforming existing ones.
Core properties of RDDs
Distributed.
Immutable (e.g. read-only, cannot change). Changing means creating a new RDD.
Lazily evaluated It means it doesn’t work right away. It waits until you actually need the
answer before it does anything. It’s like not cleaning your room until guests are about to
arrive.
Cacheable: keep in main memory whenever possible.
Replicated.
RDDs contain:
- Details about the data.
E.g. data location or the actual data
- Lineage information (history of how an RDD was created and the transformations it
underwent)
Dependencies from other RDDs. For example, if RDD2 was created from RDD1 using a
function, RDD2 knows it depends on RDD1.
Functions/transformations for recreating a lost split of an RDD from a previous RDD. If part of
the RDD is lost, Spark can use this lineage information to recreate it by going back through
the transformations from the original RDD.
Examples: RDD2 = RDD1.function_something(..). RDD3 = RDD2.function_something_else(…)
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper iuk. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €2,99. Je zit daarna nergens aan vast.