Samenvatting

Data Science Methods Summary

Name: Data Science Methods Summary
SKU: doc_509859
Rating: 3.40 (5 reviews)
Author: taylorvink

Beoordeling

3,4

(5)

Verkocht

Pagina's

Geüpload op

10-02-2019

Geschreven in

2017/2018

Summary for the course Data Science Methods, given for the Masters of Econometrics at Tilburg University in 2018. Summary of the book The Elements of Statistical Learning

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Gekoppeld boek

Trevor Hastie, Robert Tibshirani Elements of Statistical Learning

Uitgave:Onbekend
ISBN:9780387848570
Druk:2

Geschreven voor

Instelling: Tilburg University (UVT)
Studie: Econometrics and Operations Research
Vak: Data Science Methods

Alle documenten voor dit vak (2)

Documentinformatie

Heel boek samengevat?: Nee
Wat is er van het boek samengevat?: Ch1, 2, 4, 5, 6, 8, 10
Geüpload op: 10 februari 2019
Aantal pagina's: 54
Geschreven in: 2017/2018
Type: Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Data Science Methods

Week 1: Chapter 1 and Chapter 10
Supervised learning  involves building a statsttal model for predittngg or estmatng an
output based on one or more inputs. It is tlear whith outtome you want to preditt of whith
treatment efett you want to quantfy.
Unsupervised learning  there are inputs but no supervising output. Neverthelessg we tan
learn relatonships and struttures from suth data. You want to get the feel of the datag
redute the dimension. There is no good way to assess the graphs betause the goal is not
well-defned. If you defne the goal more tlearlyg you tan beter assess results. Should be
viewed as a pretursory step to supervised learning.

Clustering problem  fnd similarites by grouping individuals attording to their observed
tharatteristts. ere we are not trying to preditt an output variable. Moore on this in C 10.

Notaton
 n = number of distntt data points or observatons in our sampleg like n = 3000
people.
 p = number of variables that are available for use in making predittons (like yearg
wageg sex ett.)
 xij = jth variable of ith observatonsg where i = 1g g n and j = 1g g p.
 X = n x p matrix whose (igj)th element is xij.
 T or ‘ is notaton for transpose.
 yi is the ith observaton of the variable on whith we wish to make predittonsg like
wage.

Note that we tan only multply two matrixes A and B to get AB if the number of tolumns in A
are equal to the number of rows in B.

,Chapter 10: unsupervised learning
Unsupervised learning  set of statsttal tools intended for the setng in whith we have
only a set of features X1g X2g g Xp measured on n observatons. We are not interested in
predittong betause we do not have an assotiated response variable y. Ratherg the goal is to
distover interestng things about the measurements X1g X2g g Xp. There are two spetift
types of unsupervised learning:
1. Printipal Components Analysis  used for data visualizaton or data pre-protessing
before supervised tethniques are applied.
2. Clustering  method for distovering unknown subgroups in data.

Unsupervised learning is ofen performed as part of an exploratory data analysis. In this
taseg we tannot thetk our work betause we do not know the true answer.

Printipal Component Analysis
When fated with a large set of torrelated variablesg printipal tomponents allow us to
summarize this set with a smaller number of representatve variables that tollettvely
explain most of the variability of the original set. Printipal Component Analysis (PCA) refers
to the protess by whith printipal tomponents are tomputedg and the subsequent use of
these tomponents in understanding the data.

When p is largeg examining two-dimensional stater plots of the data is too tumbersome. A
beter method to visualize the n observatons is then required. PCA does this  it fnds a
low-dimensional representaton of a data set that tontains as muth as possible of the
variaton. IS done if we have several featuresg and we believe there are only a few underlying
traits that are important for destribing and analysing the data. PCA seeks a small number of
dimensions that are as interestng as possibleg where the tontept of interestng is measured
by the amount that the observatons vary along eath dimension. Eath of the dimensions
found by PCA is a linear tombinaton of the p features.

ow are printipal tomponents found? The frst printipal tomponent of a set of features X 1g
X2g g Xp is the normalized linear tombinaton of the features that has the largest variante:
Normalized:
 loadings of the frst printipal tomponentg whith together make the printipal
tomponent loading vettor . We tonstrain the loadings so that their sum
squares is equal to oneg sinte otherwise setng these elements to be arbitrarily large in
absolute value tould result in an arbitrarily large variante.
The frst printipal tomponent loading vettor solves the optmizaton problem:

,whithg using tan be writen as .

We refer to z11g g zn1 as the stores of the frst printipal tomponent. The maximizaton
problem mentoned above tould be solved via eigen detompositon.

Geometrit interpretaton frst printipal tomponent:
Loading vettor ϕ 1 with elements defnes a diretton in feature spate along
whith the data vary the most.

Afer the frst printipal tomponent Za1 of features has been determinedg we tan fnd the
setond printipal tomponent Za2g whith is the linear tombinaton of X1g X2g g Xp that has
maximal variante out of all linear tombinatons that are untorrelated with Za 1. The stores will
take the form:

with ϕ 2 being the setond printipal tomponent loading vettor.
It turns out that tonstraining Za2 to be untorrelated to Za1 is equivalent to tonstraining the
diretton ϕ 2 to be orthogonal (= perpenditular) to the diretton ϕ 1. To fnd ϕ 2 we solve a
similar maximizaton problem as beforeg but with the additonal tonstraint that ϕ 2 is
orthogonal to ϕ 1.
Onte we have tomputed the printipal tomponentsg we tan plot them against eath other in
order to produte low-dimensional views of the datag e.g. Za1 to Za2g Za1 to Za3 ett.

Printipal tomponent biplot: displays
both the printipal tomponent stores
and the printipal tomponent loadings.

Blue state names: represent the
stores of the frst two printipal
tomponents.
Orange arrows: inditate the frst two
printipal tomponent loading vettors.

, First loading vettor plates approximately equal weight on Assaultg Mourder and Rape (orange
lines are horizontally tlose) and muth less weight on UrbanPop. Thereforeg this tomponent
roughly torresponds to a measure of overall rates of serious trimes. The setond loading
plates most of its weight on UrbanPopg therefore roughly torresponding to the level of
urbanizaton of the state. Overallg we see that the trime-related variables are lotated tlose
to eath other – inditatng these are torrelated with eath other – and UrbanPop is less
torrelated to the rest. This inditates for example that states with high murder rates tend to
have high assault and rape rates too.
Note: Rape stores 0.54 on frst printipal tomponent (horizontal axes) but 0.17 on the setond
(verttal axes).

There is an alternatve interpretaton for printipal tomponents: printipal tomponents
provide low-dimensional linear surfates that are tlosest to the observatons.
First printipal tomponent  loading vettor is the line in p-dimensional spate that is tlosest
to the n observatonsg using average squared Eutlidean distante as a measure of tloseness.
So we seek a single dimension of the data that lies as tlose as possible to all of the data
points  will provide a good summary of the data. If we have the frst two printipal
tomponentsg they span the plane (two dimensional) that is tlosest to the n observatons. The
frst three printipal tomponents span a hyperplaneg ett. These are all in terms of Eutlidean
distante.
Using this interpretatong together the frst Mo printipal tomponent store vettors and the
frst Mo printipal tomponent loading vettors provide the best Mo-dimensional approximaton
(in terms of Eutlidean distante) to the ith observaton xijg or i.o.w. they tan give a good
approximaton of the data when Mo is suftiently large.

We already mentoned that before PCA is performedg the variables should be tentred to
have mean zero. Furthermoreg the results obtained when we perform PCA will also depend
on whether the variables have been individually staled. Results are very sensitve to the
staling used. Thereforeg it is undesirable for the printipal tomponents obtained to depend
on an arbitrary thoite of staling. Thereforeg we typitally stale eath variable to have standard
deviaton one before we tan perform PCA. In tertain setngs – for example when variables
are measured in the same units – we might not wish to stale.

€5,49

Krijg toegang tot het volledige document:

Gekocht door 16 studenten

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

taylorvink

3,3

(20)

Ook beschikbaar in voordeelbundel

Beoordelingen van geverifieerde kopers

Alle 5 reviews worden weergegeven

lotte1690 Econometrics and Operations Research · 29 beoordelingen

3 jaar geleden

scjansma Pre-Master Marketing · 8 beoordelingen

4 jaar geleden

The summary is described as "Full summary of the book The Elements of Statistical Learning" however not all chapters are summarized. This summary covers chapters 1, 2, 4, 5, 6, 8 and 10. Unfortunately, I needed chapters 3,7 and 9 but these chapters are not covered in this summary.

Reactie van de verkoper

4 jaar geleden

Apologies, I checked the description and adapted it to make it more clear that it was the full summary of the course as given in 2018. Thank you for the heads up

luukvanson Bedrijfs- en Consumentenwetenschappen · 5 beoordelingen

5 jaar geleden

daniquesabel Fiscaal Economie · 18 beoordelingen

6 jaar geleden

renewijnen28 QFAS · 1 beoordeling

6 jaar geleden

Asymptotic properties and many other derivations of different estimators are missing. Also there are some big mistakes in the summary, for example on how bootstrapping works.

Reactie van de verkoper

6 jaar geleden

Hello! About the bootstrap, I got it directly from the book used 1 year ago, no input from me. For things missing, this could be the case - I only summarized what was needed for the course back then and the scope might have changed. So sorry if this was not what you expected, and for other people; the summary is based on the course given in 2018.

3,4

5 beoordelingen

Betrouwbare reviews op Stuvia

Alle beoordelingen zijn geschreven door echte Stuvia-gebruikers na geverifieerde aankopen.

Maak kennis met de verkoper

taylorvink Universiteit Utrecht

Bekijk profiel

Volgen

Verkocht

Lid sinds

9 jaar

Aantal volgers

Documenten

Laatst verkocht

10 maanden geleden

3,3

20 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper taylorvink. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 46565 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Data Science Methods Summary

Gekoppeld boek

Geschreven voor

Documentinformatie

Onderwerpen

Voorbeeld van de inhoud

Meer vakken binnen Tilburg University (UVT) > Econometrics and Operations Research

Ook beschikbaar in voordeelbundel

Beoordelingen van geverifieerde kopers

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?