Summary

Summary 2020 Data Science & Society Final Exam Preparation

2 purchases

Course
Data Science & Society (INFOMDSS)

Institution
Universiteit Utrecht (UU)

Book
Introduction to Data Science

Data Science & Society summary. For this Summary I used the Materials of the Lecture from Period 1 2020 and the Book of Igual & Segui(2017) and Hutter et al (2019)

[Show more]

Preview 4 out of 19 pages

View example

Summarized whole book? No
Which chapters are summarized? 1-11
Uploaded on October 29, 2020
Number of pages 19
Written in 2020/2021
Type Summary

data science society

Book Title:Introduction to Data Science

Author(s):Laura Igual, Santi Segui

Edition:Unknown
ISBN:9783319500164
Edition:1

Institution
Universiteit Utrecht (UU)
Education
Master Business Informatics
Course
Data Science & Society (INFOMDSS)

$7.53

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Data Science & Society Summary Final Exam

Data Science Purposes:
 Probing Reality, Pattern Discovery, Predicting Future Events, Understanding People and the
world

Crisp-DM Model steps: 1-2(-1)-2-3-4(-3)-4-5-(-1)6
Mapping: Phases-Generic Tasks(Crisp PM) – Specialized Tasks-Process Instances(Crisp Pr.)

,Failure Indicator of Dashboard design:
o Too Flat(no sup t.exp.vis.highlighted probl.), Too Manual (automat. Collect + del. information),
Too Isolated(no view of whole system conf. or tunnel vision)
o Three Layers indicators for dashboards(Monitoring, Analysing + Drill Down)
Hadoop:

Reasons for using Hadoop: Moving Computat. To data (scheme on read style), scalability, reliability
MapReduceLayer: Jobtracker, TaskTracker(MasterNode),TaskTracker(SlaveNode)
HDFS Layer: NameNode, DataNode(MasterNode). DataNode(SlaveNode)
HDFS Takeaways: Master Slave Architecture, Cluster has single name node, User Data never flows
through NameNode
o Fault-tolerant (without RAID, with commodity Hardware)
o Between Name+DataNode (Heartbeats, Replication, Balancing)
MapReduce:

, o Brings compute to data in contrast to trad. Parallelism
o Store replicated + distributed data in HDFS (in chunks, stored on sev. Compute nodes)
o Ideal for operations on large flat datasets
o Mapper: transforms into key-value pairs multiple key, val. Pairs may occur.
o Reducer, Transforms every key,val pair with comm key into single key with a single value
YARN: enhances power of Hadoop Cluster:
o Scalability with multi tenancy, cluster utilization + MapReduce Compability + Support for other
workloads (Graphic Processing, Iterative Modelling)
o Splits up Resource Management + Job Scheduling into 2 sep. units.  one global resource
manager + per application master manager
Motivation MapReduce:
o Ever growing data, processing with more processing power, access + transport of lots of data
o Data need updating  use RDBMS, Need to skim through data  Take Computation to Data
MapReduce in Hadoop:
o User def. a map function + Hadoop replicates map to data(key,val output)
o Hadoop shuffles + groups key, val data, user defines reduce function + Hadoop distributes groups
to reducer.
MR Design Consideration:
o Composite Keys, Extra Info in Values, Cascade MapReduce Jobs, Aggregate Map Output when
possible
o Limitations for MR: Must fit key,val, MR data not persistne, Requires Programming, Not
Interactive
NOSQL Features:
o Horizontal scalability, replication over many servers, simple call level, weaker concurrency
model, efficient use of indizes, ab. To dynamic add new data records.
Arguments for SQL: Arguments for NOSQL:
o Can do everything a NOSQL system can o No benchmarks, that show scaling is ach
o Majority Market Share with SQL
o Built to handle other application loads o Easier to understand
o Common Interface o Flexible schmea
o To easy in SQL for Mulitnode operations
o Need for part. Capab.

Population: a population is a collection of objects, items (“units”) about which information is
Sought
Sample: a sample is a part of the population that is observed
Data Preparation steps: Obtaining the data -Parsing the data- Cleaning the data -Building data structures
Population Mean: An abstract concept that does not elaborate further
Average: Not strictly defined

, Mean of a Sample: Sum of Values divided by the Count
Variance: Spread of the data
Standard deviation: Square Root of mean/averaged by number of count

For Small number of sample std is biased: solution
Sample Median: Robust against outliers, Values ordered by magnitude, middle of ordered list
Ascombes Quadrangle: Descriptive statistics could be the same however, the plot can be very different.
Histogram: Shows frequency of values
PMF: Normalization of Histogram by dividing by number of samples
CDF: Descr. Prob. That real value random var X is less or equal of x.
Skewness: Negative – skews left, more datapoints left, Positive – skews right, more datapoints there,
alternative: Pearson’s median coefficient
Exponential distribution: λ defines the shape of the distribution, mean is 1/ λ, variance, 1/ λ^2, median ln(2)/ λ

Standard Score: Normalization of Data

Covariance: If two shared vars share the same tendency – COV itself hard to interpret

Pearson’s Correlation: Normalization of data in respect to their deviation:
Spearmans Rank Correlation: Adresess Robustness problem when data contains outliers. Differences of
values between sets.
Frequentist Approach: Assume that there’s a population that can be represented by sev. Parameters,
param. Are fixed but not vis to the population. Way to estimate is to take a sample
Bayesian Approach: Assume that data is fixed, but not the result of samling process, but describing data
can be done proababistically. Bays. Appr. Focus on prod parameter distr. That represent all the
knowledge that can be extracted
Problem faced when varying from one sample to another: will not be equal to the parameter of interest:

compute standard error or standard deviation of mean σx¯,:
Computational Intencise: Bootstraping: Drawing n obersv. With replacement. Then calculate mean of
this.
Confidence Interval: Plausible range of values, plausibility defined from sampling distribution ex: C I = [Θ -
1.96 × SE, Θ + 1.96 × SE] for Θ ± z × SE

95% CI: 5% of the interval does not contain the true mean

P-Value: Prob. Of obs. Data at least as favorable to the alternative hpythesis if the null hypothesis is true: Means: Given a
sample and an apparent effect, what is the prob of seeing such an effect by chance?

Supervised Learning: Alg. That learn from labelled example to gen. to set of all poss. Inputs.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller crperling. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $7.53. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

65507 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller