Data storage Data analysis
Cheap and simple. Easy to generate and gather. Difficult and expensive to analyse.
Main issues in Big Data: (Three V’s)
Velocity (1) Incoming Speed (2) handling speed -> Result in bottlenecks
Volume Actual Quantity of data -> Finite storage
Variety (1) Types (2) Structured or unstructed (3) Irregular Intervals -> To slow/unreliable
(1) Analysis becomes to slow/ureliable (2) Systems unresponsive (3) Day-to-day business impacted
Solutions to Big Data:
Invest in hardware Invest in Software
(1) Store more data, linear increase (1) Use all available resources, linear increase
(2) Process more data, linear increase (2) Approximate results
- Does not solve time complexity O(nx) (3) Model, periodically updated
Parallel Computing Distributed computing
Single node architecture Cluster architecture, nodes and switches
Shared memory, multiple CPU threads, Single FS Multiple CPU’s, multiple memory, multiple FS
Machine learning, Statictics (memory) Communication over Network (Computational
Perspective)
“Classical” Data Mining (memory + disk) Heteregeneous hardware
Tolerant and Reliable (Service Perspective)
Scalable
Services Batch processing Stream processing
Online Offline Near real-time
Single-task / Request-Response Large data, periodically Consumes inputs
Availability, Response-time Throughput Low delay
Google, Recommender system Model, Hyper-parameters Monitoring, Detection
Map-Reduce
User System
(1) Map (2) Reduce (1) Partitioning (2) Scheduling (3) Grouping (4) Failures
Master Node: (1) Coordination: Idle, in-progess, completed, data location
(2) Node failures: Ping and Reset/Reschedule
Worker Node: (1) Chunk servers
(2) Map tasks: fail is reset in-progress + completed writes to local FS
(3) Reduce tasks, fail is reset in-progress writes to DFS
Dynamic load balancing: (1) Heterogeneous hardware (2) Different chunks (3) Machine failures recovery
(1) Fine task granularity: Map tasks >> machines for (a) Pipeline shuffeling (b) Recovery
(2) Back-up tasks
Communication Cost (Bytes) Computation cost (Seconds)
Elapsed: I/O max Reduce + I/O max Map Elapsed: Max Reduce + Max Map (Wall-clock
time)
Total: Total I/O (In Map, 2x Out Map, Out Reduce) Total: All Reduce + All Map (Serial running
time)
,Map-Reduce Spark
Non-Flexible Flexible (and easy to use)
(1) Workflow: (1) Workflow:
(a) Map (a) Transformation of RDD (Map, Filter, Union)
(b) Reduce (b) Actions (Reduce, Count, Collect)
(c) Redundant steps (c) Lazy evaluation and optimal order, for more
(2) Data Format: DFS Only Key-Value pairs complicated Directed Acyclic Graphs
(2) Data Format: RDD in memory (cachable)
transformable without I/O. Can persist RDD in each
node for future computations (in memory or disk)
Slow: (1) Needs disk I/O everytime Fast: (1) Most things done in memory
(2) Redundancy (2) Flexibility
Use Map-Reduce Use Spark
Simple use-cases, single pass of data More complex use cases, multiple passes of data
Need little memory: Need a lot of memory:
(1) Does not interfere with other users (1) Know your memory requirements
(2) Does not interfere with other programs (2) Single user and task environment
Matured Not fully Mature
Quick processing times (Streams)
Easier to program and more general
1.1 Parallel computing
Three Vs of big data: Volume, Velocity (incoming and handling), Variety (type and interval)
Improve speed: (1)Hardware, (2)Software: (a)use all hardware, (b)approximate, (c)model
Parallel computing: Instruction level parallelism, Task parallelism or Data parallelism
Speed-up: Tserial= p*T + (1-p)*T; Tparallel=p*T/s + (1-p)*T; Speed-up: Tserial/Tparallel
Speed-up preventions: Data Dependencies (y=c*x) Branches (if x=)
Examples: Adding numbers 1,2,3 (1+2, 3+3, 2TS). Inner Products 1,2*3,4 (1*3, 2*4)
1.3 Distributed Computing
Distributed Computing: Separate machines, network, hetergeneoous, reliable, scalable, switches,
Map-reduce: Data parallelization across nodes
Use-cases distributed computing: (1) Services: online, single-task, request-response. Requires:
availability and speed. Examples: Google or recommender system, different thread for each request
(2) Batch Process: Offline, large data, periodically: Requires: high throughput. Examples: parameter
search, updating models of recommender system.
(3) Stream process: near real-time not request driven Requires: low delay. Examples: Monitoring
2.1 Map-Reduce
Problems with distributed computing and Map-Reduce solution: (1) Network bottlenecks: Data
parallelization, send tasks to data (2) Easy to write: hide underlying magic by just needing using
input for map and reduce (3) Machines fail: Distributed File system -> static large data, only reading
appending. Chunk servers with redundancy with master node for coordination
Map-Reduce: (1) Map: extract something you care about (2) Send and group by key (3) Reduce:
Aggregate or transform.
InputKV <k,v> map intermediateKV <k’,v’> sort groupedKV <k’,<v’,v’>> reduce outputKV <k’,v’’>
User: Map and Reduce, System: partitioning, scheduling, group by, failures, communication
Dataflow: only input output on DFS, intermediate on Local FS. Tasks close to physical storage
Coordination: Master node-> status of tasks (M>mach), location on local FS, ping for failures
,Failures: Mapper: All map tasks restart (->Idle), result lost because stored on local FS, reducers need
input from else where. Reducer: Only in-progress tasks set to idle, writes to DFS so stored. Master
fail: Abort.
Map vs Reduce tasks: Fine granularity = MapT >> machines (1) dynamic load balancing, pipeline
shuffeling (2) recovery from failures. Also make MapT > RedT.
Refinements: (1)Back-up tasks: send last task in progress to multiple machines, to see if quicker.
Makes completion time quicker. (2)Combiners: reduce map output, save network bandwith make
quicker. Only when aggregate function works on chunks! The function has to be both communative
a+b=b+a and associative (a+b)+c = a+(b+c) when working with chunks of different sizes.
Cost Measures: Total/elapsed Communication cost (bytes): toal i/o of all processes (input file size +
2*output map + output file size) / max i/o of any map and any reduce task. Shipping of data.
Total/elapsed Computation cost (sec): serial running time of all processes / wall-clock parallel time.
Sorting and reducing. Usually communication cost dominates, and we ignore computation cost.
2.2 Map-Reduce applications
RA Selection: keep only tuples that satisfy C, delete rows from table
MAP: if (R,t) satisfies C, output (t,t) GB: (t,<t>) RED: (t,t) GB,RED redundant
RA Projection: removes columns from a table
MAP: (R,(p,d)) -> (p,1) GB:(p,<1,1>) RED: (p,p) RED removes duplicates
RA Union: for tables of same schema, output all tuples in one of the two relations
MAP: (A,t)/(B,t) -> (t,t) GB: (t,<t,t> RED: (t,t)
RA Intersection: tables of same schema, output tuples in common
MAP: (A,t)/(B,t) -> (t,A)/(t,B) GB: (t,<A,B>), (t,<A>) RED: if value <A,B> output <t,t>
RA Difference: same schema, present in A but not in B
MAP: (A,t)/(B,t) -> (t,A)/(t,B) GB: (t,<A,B>), (t,<A>) RED: if value <A> output <t,t>
RA Natural Join: different schema, join on common items
MAP: (A,(z,b))/(B,(b,r)) -> (b,(A,z))/(b,(B,r)) GB: (b,(A,z),(B,r)) RED: if <(A,z)(B,r)> output (z,b,r)
RA Aggregate/Sum: produce a sum of some value for each a
MAP: (A,(a,b,c)) -> (a,c) GB: (a,<c,c>) RED: (a, sum(<c,c>)) (also can count,avg,min,max)
LA Matrix-Vector: stored as CSV: (0,0,a) (1,0,c) (0,1,b) (1,1,d) * vector (x,y) c0*x, c1*y, r=key
MAP: (0,0,a)*x (1,0,c)*x and (0,1,b)*y (1,1,d)*y GB: (0,<ax,by>),(1,<cx,dy>) RED: SUM
LA Matrix-Matrix (2): first round removes multiplications with 0s Note: combine all j’s from different
schemas. Better for sparese matrices as *0s get eliminated.
MAP: (j,(A,i, aij)) (j,(B,k,bjk)) GB: j,<> RED: ((i,k),(aij*bjk)) (2) sum GB/reduce only sum(values)
i=1 1 2 5 6 j=1 [(1,1)(1*5)] [(1,2)(1*6)]
i=2 3 4 7 8 j=2 [(2,1)(3*5)] [(2,2)(3*6)]
j=1 j=2 k=1 k=2
LA Matrix-Matrix (1):
MAP: ((i,k), (A,j,Aij)) ((i,k), (B,j,Bjk)), for length of k and i. So combine each I with possible k.
i=1 1 2 5 6 j=1 (1,1)(1,j1), (1,1)(5,j1) (1,2)(1,j1)
i=2 3 4 7 8 j=2 (2,1)(3,j1),(1,2)(5,j1) (2,2)(3,j1)
j=1 j=2 k=1 k=2 GB:(1,1)<> RED: Sum(mult(on j))
Sorting: Only under certain conditions: range and distribution known
MAP: (int(x/100),int): (174),(16) -> (1,174),(0,16). GB: hundreds RED: sort each group parallel
-Works only when groups significantly smaller than total, sqrt(n) groups of sqrt(n) numbers. To make
groups equal size we need know distribution. Keys have to be sorted after reduce process as well.
3.1 Spark
Map-reduce is inflexible in its steps and needs to read and write to the disk everytime with its DFS
Spark is an exntension of Map-Reduce, with the benefits of:
,(1) Being more flexible. (a) You can change the order of/skip steps, while in Map-Reduce you have
to Map and Reduce. Sometimes one of steps is reduntant (selection) spark only computes something
when an action requires it (lazy evaluation), or another order is more efficient (first filter than map).
Optimizing the steps in directed acyclic graphs of dataflow. (b) you have more actions that simplify
programming which are divided in Transformations (map, filter, union, itersection) and Actions
(reduce, count, collect).
(2) Spark does not need to read and write to disk everytime, it uses Resilient Distributed Datasets
which work only in main memory by sending memory states as an object. These RDD’s are also fault
tolerant, allowing you to recompute partitions and flexible beyond only key-value pairs. You can also
manually persist items to memory.
Newer versions feature DataFrames (named columns), Dataset (type-safe), whereas RDDs
operate with data at lower levels of abstraction.
Spark(+) Spark(-)
Faster than map-reduce, by using memory You will need lots of memory to run it, might
Easier to use than map-reduce interfere with other stuff you are running
More general than map-reduce Less mature than map-reduce
Use Spark When: Use Map-Reduce (HadoopDFS). When:
Stream processes (quick) Simple use cases, single pass of data
Flexibility Multi-user environment
Iterative machine learning tasks, mult passes Low on working memory
4.1 Data Streams
In streams we do not know the size of the data set in advance, the data we think of as infinite and
non-stationary therefore we can not store it. Even if we would be able to store it, algorhitms
become to slow to process it in time. The input rate is controlled externally.
Applications include: What search queries are frequent? What are trending topics? Should an alarm
be raised? Record billing information.
We can distinguish several query problems in mining data streams:
4.2/4.3 Sampling from a data stream
1. Problem: Keeping a sample of fixed proportion from the entire stream: finite stream
Naïve Solution:
We generate random numbers for each element that comes in and keep it when the randomly
generated number is X.
So if we wish to keep 1/10th of the stream we generate 0-9 and keep if an element gets the
number 0. But this approach is not accurate for some problems, like trying to calculate the
proportion of duplicates.
Suppose:
we have x+2d queries of which d are the duplicates then we have a true proportion of duplicates
of d/(x+d)
when we use the naïve sampling method we obtain:
x/10 of the unique queries, and
2d/10 of the duplicate queries, but the chance that a pair shows up is much smaller as each query
only has a 1/10 chance of being selected, consider we have a pair of duplicates: <a,a> and
generate a random number for each of them, then:
selecting both a’s has a chance of 1/10 * 1/10 = 1/100
selecting one a has a chance of 1/10 * 9/10 or 9/10 * 1/10 = 18/100
selecting no a has a chance of 9/10 * 9/10 = 81/100
which results in the calculated duplicates from our sample being d/(10x+19d), which is a large
underestimation.
, Solution: Hashing users into buckets
We pick one tenth of the users in the sample by using a hash function that hashes all user ids (or
other key) uniformly over 10 buckets. Then take the sample if the hash value is in a certain
bucket(s).
2. Problem: Keeping a sample of fixed size from the entire stream: infinite stream
If the stream is long we also want to maintain a sample that has a certain size. Because we do not
know the length of the stream we can not know with what probability to sample a certain
element.
Solution: Reservoir Samplig
1. We store the first s elements of the stream, equal to the size of the sample of size S we want to
maintain. We also have to store n seperately.
2. When the next element arrives (n>s) we:
Keep this element with a s/n+1 probability, else discard it.
If we kept this element, then it replaces one of the s elements in S uniformly at random
This sample now has the property that it contains each element seen so far with s/n.
Say we want a S of 2, we first store the first two element of stream. Suppose those are a,b. S is
now <a,b>. The next element that arrives, suppose that is c gets kept with a 2/3 probability and if
kept replaces a or b with equal probability. Each element now has a 2/3 chance of being in the
sample.
Proof:
We use the base case where s=n and look at the probability that an element i in S stays in S for
each new element that arrives.
%(n+1 discarded) %(n+1 kept and replaces element i)
So at each point in time each element has a n/n+1 probability of being in
replaced. So from the base case and each point in time each element has
the following probability of being in S:
4.4 Sliding Window: Counting elements
Suppose:
We want to have an estimate of the number of 1’s in the last N bits received. The only way to get
an exact answer is to store all bits in the entire window and disgard the last element when a new
bit comes in to preserve the length of the window.
Using a uniformity assumption (stationary data):
We assume that the bits come in with a uniform distribution, meaning that the distribution does
not change over-time. We now only maintain two counters:
S: The number of 1’s from the beginning of the stream
Z: The number of 0’s from the beginning of the stream
We now use this ratio to estimate the number of 1s in the last N elements:
Solution: DGIM (non-stationary data)
This solution does not assume uniformity and gives an answer that is never more than 50% off.