100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary 2020 Data Science & Society Final Exam Preparation €6,99
In winkelwagen

Samenvatting

Summary 2020 Data Science & Society Final Exam Preparation

 47 keer bekeken  2 keer verkocht

Data Science & Society summary. For this Summary I used the Materials of the Lecture from Period 1 2020 and the Book of Igual & Segui(2017) and Hutter et al (2019)

Voorbeeld 4 van de 19  pagina's

  • Nee
  • 1-11
  • 29 oktober 2020
  • 19
  • 2020/2021
  • Samenvatting
book image

Titel boek:

Auteur(s):

  • Uitgave:
  • ISBN:
  • Druk:
Alle documenten voor dit vak (2)
avatar-seller
crperling
Data Science & Society Summary Final Exam

Data Science Purposes:
 Probing Reality, Pattern Discovery, Predicting Future Events, Understanding People and the
world


Crisp-DM Model steps: 1-2(-1)-2-3-4(-3)-4-5-(-1)6
Mapping: Phases-Generic Tasks(Crisp PM) – Specialized Tasks-Process Instances(Crisp Pr.)

,Failure Indicator of Dashboard design:
o Too Flat(no sup t.exp.vis.highlighted probl.), Too Manual (automat. Collect + del. information),
Too Isolated(no view of whole system conf. or tunnel vision)
o Three Layers indicators for dashboards(Monitoring, Analysing + Drill Down)
Hadoop:




Reasons for using Hadoop: Moving Computat. To data (scheme on read style), scalability, reliability
MapReduceLayer: Jobtracker, TaskTracker(MasterNode),TaskTracker(SlaveNode)
HDFS Layer: NameNode, DataNode(MasterNode). DataNode(SlaveNode)
HDFS Takeaways: Master Slave Architecture, Cluster has single name node, User Data never flows
through NameNode
o Fault-tolerant (without RAID, with commodity Hardware)
o Between Name+DataNode (Heartbeats, Replication, Balancing)
MapReduce:

, o Brings compute to data in contrast to trad. Parallelism
o Store replicated + distributed data in HDFS (in chunks, stored on sev. Compute nodes)
o Ideal for operations on large flat datasets
o Mapper: transforms into key-value pairs multiple key, val. Pairs may occur.
o Reducer, Transforms every key,val pair with comm key into single key with a single value
YARN: enhances power of Hadoop Cluster:
o Scalability with multi tenancy, cluster utilization + MapReduce Compability + Support for other
workloads (Graphic Processing, Iterative Modelling)
o Splits up Resource Management + Job Scheduling into 2 sep. units.  one global resource
manager + per application master manager
Motivation MapReduce:
o Ever growing data, processing with more processing power, access + transport of lots of data
o Data need updating  use RDBMS, Need to skim through data  Take Computation to Data
MapReduce in Hadoop:
o User def. a map function + Hadoop replicates map to data(key,val output)
o Hadoop shuffles + groups key, val data, user defines reduce function + Hadoop distributes groups
to reducer.
MR Design Consideration:
o Composite Keys, Extra Info in Values, Cascade MapReduce Jobs, Aggregate Map Output when
possible
o Limitations for MR: Must fit key,val, MR data not persistne, Requires Programming, Not
Interactive
NOSQL Features:
o Horizontal scalability, replication over many servers, simple call level, weaker concurrency
model, efficient use of indizes, ab. To dynamic add new data records.
Arguments for SQL: Arguments for NOSQL:
o Can do everything a NOSQL system can o No benchmarks, that show scaling is ach
o Majority Market Share with SQL
o Built to handle other application loads o Easier to understand
o Common Interface o Flexible schmea
o To easy in SQL for Mulitnode operations
o Need for part. Capab.


Population: a population is a collection of objects, items (“units”) about which information is
Sought
Sample: a sample is a part of the population that is observed
Data Preparation steps: Obtaining the data -Parsing the data- Cleaning the data -Building data structures
Population Mean: An abstract concept that does not elaborate further
Average: Not strictly defined

, Mean of a Sample: Sum of Values divided by the Count
Variance: Spread of the data
Standard deviation: Square Root of mean/averaged by number of count

For Small number of sample std is biased: solution
Sample Median: Robust against outliers, Values ordered by magnitude, middle of ordered list
Ascombes Quadrangle: Descriptive statistics could be the same however, the plot can be very different.
Histogram: Shows frequency of values
PMF: Normalization of Histogram by dividing by number of samples
CDF: Descr. Prob. That real value random var X is less or equal of x.
Skewness: Negative – skews left, more datapoints left, Positive – skews right, more datapoints there,
alternative: Pearson’s median coefficient
Exponential distribution: λ defines the shape of the distribution, mean is 1/ λ, variance, 1/ λ^2, median ln(2)/ λ

Standard Score: Normalization of Data

Covariance: If two shared vars share the same tendency – COV itself hard to interpret

Pearson’s Correlation: Normalization of data in respect to their deviation:
Spearmans Rank Correlation: Adresess Robustness problem when data contains outliers. Differences of
values between sets.
Frequentist Approach: Assume that there’s a population that can be represented by sev. Parameters,
param. Are fixed but not vis to the population. Way to estimate is to take a sample
Bayesian Approach: Assume that data is fixed, but not the result of samling process, but describing data
can be done proababistically. Bays. Appr. Focus on prod parameter distr. That represent all the
knowledge that can be extracted
Problem faced when varying from one sample to another: will not be equal to the parameter of interest:

compute standard error or standard deviation of mean σx¯,:
Computational Intencise: Bootstraping: Drawing n obersv. With replacement. Then calculate mean of
this.
Confidence Interval: Plausible range of values, plausibility defined from sampling distribution ex: C I = [Θ -
1.96 × SE, Θ + 1.96 × SE] for Θ ± z × SE

95% CI: 5% of the interval does not contain the true mean

P-Value: Prob. Of obs. Data at least as favorable to the alternative hpythesis if the null hypothesis is true: Means: Given a
sample and an apparent effect, what is the prob of seeing such an effect by chance?

Supervised Learning: Alg. That learn from labelled example to gen. to set of all poss. Inputs.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper crperling. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 53068 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€6,99  2x  verkocht
  • (0)
In winkelwagen
Toegevoegd