Data Science & Society Summary Final Exam
Data Science Purposes:
Probing Reality, Pattern Discovery, Predicting Future Events, Understanding People and the
world
Crisp-DM Model steps: 1-2(-1)-2-3-4(-3)-4-5-(-1)6
Mapping: Phases-Generic Tasks(Crisp PM) – Specialized Tasks-Process Instances(Crisp Pr.)
,Failure Indicator of Dashboard design:
o Too Flat(no sup t.exp.vis.highlighted probl.), Too Manual (automat. Collect + del. information),
Too Isolated(no view of whole system conf. or tunnel vision)
o Three Layers indicators for dashboards(Monitoring, Analysing + Drill Down)
Hadoop:
Reasons for using Hadoop: Moving Computat. To data (scheme on read style), scalability, reliability
MapReduceLayer: Jobtracker, TaskTracker(MasterNode),TaskTracker(SlaveNode)
HDFS Layer: NameNode, DataNode(MasterNode). DataNode(SlaveNode)
HDFS Takeaways: Master Slave Architecture, Cluster has single name node, User Data never flows
through NameNode
o Fault-tolerant (without RAID, with commodity Hardware)
o Between Name+DataNode (Heartbeats, Replication, Balancing)
MapReduce:
, o Brings compute to data in contrast to trad. Parallelism
o Store replicated + distributed data in HDFS (in chunks, stored on sev. Compute nodes)
o Ideal for operations on large flat datasets
o Mapper: transforms into key-value pairs multiple key, val. Pairs may occur.
o Reducer, Transforms every key,val pair with comm key into single key with a single value
YARN: enhances power of Hadoop Cluster:
o Scalability with multi tenancy, cluster utilization + MapReduce Compability + Support for other
workloads (Graphic Processing, Iterative Modelling)
o Splits up Resource Management + Job Scheduling into 2 sep. units. one global resource
manager + per application master manager
Motivation MapReduce:
o Ever growing data, processing with more processing power, access + transport of lots of data
o Data need updating use RDBMS, Need to skim through data Take Computation to Data
MapReduce in Hadoop:
o User def. a map function + Hadoop replicates map to data(key,val output)
o Hadoop shuffles + groups key, val data, user defines reduce function + Hadoop distributes groups
to reducer.
MR Design Consideration:
o Composite Keys, Extra Info in Values, Cascade MapReduce Jobs, Aggregate Map Output when
possible
o Limitations for MR: Must fit key,val, MR data not persistne, Requires Programming, Not
Interactive
NOSQL Features:
o Horizontal scalability, replication over many servers, simple call level, weaker concurrency
model, efficient use of indizes, ab. To dynamic add new data records.
Arguments for SQL: Arguments for NOSQL:
o Can do everything a NOSQL system can o No benchmarks, that show scaling is ach
o Majority Market Share with SQL
o Built to handle other application loads o Easier to understand
o Common Interface o Flexible schmea
o To easy in SQL for Mulitnode operations
o Need for part. Capab.
Population: a population is a collection of objects, items (“units”) about which information is
Sought
Sample: a sample is a part of the population that is observed
Data Preparation steps: Obtaining the data -Parsing the data- Cleaning the data -Building data structures
Population Mean: An abstract concept that does not elaborate further
Average: Not strictly defined
, Mean of a Sample: Sum of Values divided by the Count
Variance: Spread of the data
Standard deviation: Square Root of mean/averaged by number of count
For Small number of sample std is biased: solution
Sample Median: Robust against outliers, Values ordered by magnitude, middle of ordered list
Ascombes Quadrangle: Descriptive statistics could be the same however, the plot can be very different.
Histogram: Shows frequency of values
PMF: Normalization of Histogram by dividing by number of samples
CDF: Descr. Prob. That real value random var X is less or equal of x.
Skewness: Negative – skews left, more datapoints left, Positive – skews right, more datapoints there,
alternative: Pearson’s median coefficient
Exponential distribution: λ defines the shape of the distribution, mean is 1/ λ, variance, 1/ λ^2, median ln(2)/ λ
Standard Score: Normalization of Data
Covariance: If two shared vars share the same tendency – COV itself hard to interpret
Pearson’s Correlation: Normalization of data in respect to their deviation:
Spearmans Rank Correlation: Adresess Robustness problem when data contains outliers. Differences of
values between sets.
Frequentist Approach: Assume that there’s a population that can be represented by sev. Parameters,
param. Are fixed but not vis to the population. Way to estimate is to take a sample
Bayesian Approach: Assume that data is fixed, but not the result of samling process, but describing data
can be done proababistically. Bays. Appr. Focus on prod parameter distr. That represent all the
knowledge that can be extracted
Problem faced when varying from one sample to another: will not be equal to the parameter of interest:
compute standard error or standard deviation of mean σx¯,:
Computational Intencise: Bootstraping: Drawing n obersv. With replacement. Then calculate mean of
this.
Confidence Interval: Plausible range of values, plausibility defined from sampling distribution ex: C I = [Θ -
1.96 × SE, Θ + 1.96 × SE] for Θ ± z × SE
95% CI: 5% of the interval does not contain the true mean
P-Value: Prob. Of obs. Data at least as favorable to the alternative hpythesis if the null hypothesis is true: Means: Given a
sample and an apparent effect, what is the prob of seeing such an effect by chance?
Supervised Learning: Alg. That learn from labelled example to gen. to set of all poss. Inputs.