Samenvatting

Summary 2020 Data Science & Society Final Exam Preparation

2 keer verkocht

Instelling
Universiteit Utrecht (UU)

Boek
Introduction to Data Science

Data Science & Society summary. For this Summary I used the Materials of the Lecture from Period 1 2020 and the Book of Igual & Segui(2017) and Hutter et al (2019)

[Meer zien]

Voorbeeld 4 van de 19 pagina's

Bekijk voorbeeld

Heel boek samengevat? Nee
Wat is er van het boek samengevat? 1-11
Geupload op 29 oktober 2020
Aantal pagina's 19
Geschreven in 2020/2021
Type Samenvatting

data science society

€6,99

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Data Science & Society Summary Final Exam

Data Science Purposes:
 Probing Reality, Pattern Discovery, Predicting Future Events, Understanding People and the
world

Crisp-DM Model steps: 1-2(-1)-2-3-4(-3)-4-5-(-1)6
Mapping: Phases-Generic Tasks(Crisp PM) – Specialized Tasks-Process Instances(Crisp Pr.)

,Failure Indicator of Dashboard design:
o Too Flat(no sup t.exp.vis.highlighted probl.), Too Manual (automat. Collect + del. information),
Too Isolated(no view of whole system conf. or tunnel vision)
o Three Layers indicators for dashboards(Monitoring, Analysing + Drill Down)
Hadoop:

Reasons for using Hadoop: Moving Computat. To data (scheme on read style), scalability, reliability
MapReduceLayer: Jobtracker, TaskTracker(MasterNode),TaskTracker(SlaveNode)
HDFS Layer: NameNode, DataNode(MasterNode). DataNode(SlaveNode)
HDFS Takeaways: Master Slave Architecture, Cluster has single name node, User Data never flows
through NameNode
o Fault-tolerant (without RAID, with commodity Hardware)
o Between Name+DataNode (Heartbeats, Replication, Balancing)
MapReduce:

, o Brings compute to data in contrast to trad. Parallelism
o Store replicated + distributed data in HDFS (in chunks, stored on sev. Compute nodes)
o Ideal for operations on large flat datasets
o Mapper: transforms into key-value pairs multiple key, val. Pairs may occur.
o Reducer, Transforms every key,val pair with comm key into single key with a single value
YARN: enhances power of Hadoop Cluster:
o Scalability with multi tenancy, cluster utilization + MapReduce Compability + Support for other
workloads (Graphic Processing, Iterative Modelling)
o Splits up Resource Management + Job Scheduling into 2 sep. units.  one global resource
manager + per application master manager
Motivation MapReduce:
o Ever growing data, processing with more processing power, access + transport of lots of data
o Data need updating  use RDBMS, Need to skim through data  Take Computation to Data
MapReduce in Hadoop:
o User def. a map function + Hadoop replicates map to data(key,val output)
o Hadoop shuffles + groups key, val data, user defines reduce function + Hadoop distributes groups
to reducer.
MR Design Consideration:
o Composite Keys, Extra Info in Values, Cascade MapReduce Jobs, Aggregate Map Output when
possible
o Limitations for MR: Must fit key,val, MR data not persistne, Requires Programming, Not
Interactive
NOSQL Features:
o Horizontal scalability, replication over many servers, simple call level, weaker concurrency
model, efficient use of indizes, ab. To dynamic add new data records.
Arguments for SQL: Arguments for NOSQL:
o Can do everything a NOSQL system can o No benchmarks, that show scaling is ach
o Majority Market Share with SQL
o Built to handle other application loads o Easier to understand
o Common Interface o Flexible schmea
o To easy in SQL for Mulitnode operations
o Need for part. Capab.

Population: a population is a collection of objects, items (“units”) about which information is
Sought
Sample: a sample is a part of the population that is observed
Data Preparation steps: Obtaining the data -Parsing the data- Cleaning the data -Building data structures
Population Mean: An abstract concept that does not elaborate further
Average: Not strictly defined

, Mean of a Sample: Sum of Values divided by the Count
Variance: Spread of the data
Standard deviation: Square Root of mean/averaged by number of count

For Small number of sample std is biased: solution
Sample Median: Robust against outliers, Values ordered by magnitude, middle of ordered list
Ascombes Quadrangle: Descriptive statistics could be the same however, the plot can be very different.
Histogram: Shows frequency of values
PMF: Normalization of Histogram by dividing by number of samples
CDF: Descr. Prob. That real value random var X is less or equal of x.
Skewness: Negative – skews left, more datapoints left, Positive – skews right, more datapoints there,
alternative: Pearson’s median coefficient
Exponential distribution: λ defines the shape of the distribution, mean is 1/ λ, variance, 1/ λ^2, median ln(2)/ λ

Standard Score: Normalization of Data

Covariance: If two shared vars share the same tendency – COV itself hard to interpret

Pearson’s Correlation: Normalization of data in respect to their deviation:
Spearmans Rank Correlation: Adresess Robustness problem when data contains outliers. Differences of
values between sets.
Frequentist Approach: Assume that there’s a population that can be represented by sev. Parameters,
param. Are fixed but not vis to the population. Way to estimate is to take a sample
Bayesian Approach: Assume that data is fixed, but not the result of samling process, but describing data
can be done proababistically. Bays. Appr. Focus on prod parameter distr. That represent all the
knowledge that can be extracted
Problem faced when varying from one sample to another: will not be equal to the parameter of interest:

compute standard error or standard deviation of mean σx¯,:
Computational Intencise: Bootstraping: Drawing n obersv. With replacement. Then calculate mean of
this.
Confidence Interval: Plausible range of values, plausibility defined from sampling distribution ex: C I = [Θ -
1.96 × SE, Θ + 1.96 × SE] for Θ ± z × SE

95% CI: 5% of the interval does not contain the true mean

P-Value: Prob. Of obs. Data at least as favorable to the alternative hpythesis if the null hypothesis is true: Means: Given a
sample and an apparent effect, what is the prob of seeing such an effect by chance?

Supervised Learning: Alg. That learn from labelled example to gen. to set of all poss. Inputs.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper crperling. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 65507 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

Samenvatting

Summary 2020 Data Science & Society Final Exam Preparation

Document informatie

Onderwerpen

Gekoppeld boek

Geschreven voor

Verkoper

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?