100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Data Science Methods Summary $5.90
Add to cart

Summary

Data Science Methods Summary

5 reviews
 257 views  16 purchases
  • Course
  • Institution
  • Book

Summary for the course Data Science Methods, given for the Masters of Econometrics at Tilburg University in 2018. Summary of the book The Elements of Statistical Learning

Preview 4 out of 54  pages

  • No
  • Ch1, 2, 4, 5, 6, 8, 10
  • February 10, 2019
  • 54
  • 2017/2018
  • Summary

5  reviews

review-writer-avatar

By: lotte1690 • 2 year ago

review-writer-avatar

By: scjansma • 3 year ago

The summary is described as "Full summary of the book The Elements of Statistical Learning" however not all chapters are summarized. This summary covers chapters 1, 2, 4, 5, 6, 8 and 10. Unfortunately, I needed chapters 3,7 and 9 but these chapters are not covered in this summary.

reply-writer-avatar

By: taylorvink • 3 year ago

Apologies, I checked the description and adapted it to make it more clear that it was the full summary of the course as given in 2018. Thank you for the heads up

review-writer-avatar

By: luukvanson • 4 year ago

review-writer-avatar

By: daniquesabel • 5 year ago

review-writer-avatar

By: renewijnen28 • 5 year ago

Asymptotic properties and many other derivations of different estimators are missing. Also there are some big mistakes in the summary, for example on how bootstrapping works.

reply-writer-avatar

By: taylorvink • 5 year ago

Hello! About the bootstrap, I got it directly from the book used 1 year ago, no input from me. For things missing, this could be the case - I only summarized what was needed for the course back then and the scope might have changed. So sorry if this was not what you expected, and for other people; the summary is based on the course given in 2018.

avatar-seller
Data Science Methods

Week 1: Chapter 1 and Chapter 10
Supervised learning  involves building a statsttal model for predittngg or estmatng an
output based on one or more inputs. It is tlear whith outtome you want to preditt of whith
treatment efett you want to quantfy.
Unsupervised learning  there are inputs but no supervising output. Neverthelessg we tan
learn relatonships and struttures from suth data. You want to get the feel of the datag
redute the dimension. There is no good way to assess the graphs betause the goal is not
well-defned. If you defne the goal more tlearlyg you tan beter assess results. Should be
viewed as a pretursory step to supervised learning.

Clustering problem  fnd similarites by grouping individuals attording to their observed
tharatteristts. ere we are not trying to preditt an output variable. Moore on this in C 10.

Notaton
 n = number of distntt data points or observatons in our sampleg like n = 3000
people.
 p = number of variables that are available for use in making predittons (like yearg
wageg sex ett.)
 xij = jth variable of ith observatonsg where i = 1g g n and j = 1g g p.
 X = n x p matrix whose (igj)th element is xij.
 T or ‘ is notaton for transpose.
 yi is the ith observaton of the variable on whith we wish to make predittonsg like
wage.




Note that we tan only multply two matrixes A and B to get AB if the number of tolumns in A
are equal to the number of rows in B.

,Chapter 10: unsupervised learning
Unsupervised learning  set of statsttal tools intended for the setng in whith we have
only a set of features X1g X2g g Xp measured on n observatons. We are not interested in
predittong betause we do not have an assotiated response variable y. Ratherg the goal is to
distover interestng things about the measurements X1g X2g g Xp. There are two spetift
types of unsupervised learning:
1. Printipal Components Analysis  used for data visualizaton or data pre-protessing
before supervised tethniques are applied.
2. Clustering  method for distovering unknown subgroups in data.

Unsupervised learning is ofen performed as part of an exploratory data analysis. In this
taseg we tannot thetk our work betause we do not know the true answer.

Printipal Component Analysis
When fated with a large set of torrelated variablesg printipal tomponents allow us to
summarize this set with a smaller number of representatve variables that tollettvely
explain most of the variability of the original set. Printipal Component Analysis (PCA) refers
to the protess by whith printipal tomponents are tomputedg and the subsequent use of
these tomponents in understanding the data.

When p is largeg examining two-dimensional stater plots of the data is too tumbersome. A
beter method to visualize the n observatons is then required. PCA does this  it fnds a
low-dimensional representaton of a data set that tontains as muth as possible of the
variaton. IS done if we have several featuresg and we believe there are only a few underlying
traits that are important for destribing and analysing the data. PCA seeks a small number of
dimensions that are as interestng as possibleg where the tontept of interestng is measured
by the amount that the observatons vary along eath dimension. Eath of the dimensions
found by PCA is a linear tombinaton of the p features.

ow are printipal tomponents found? The frst printipal tomponent of a set of features X 1g
X2g g Xp is the normalized linear tombinaton of the features that has the largest variante:
Normalized:
 loadings of the frst printipal tomponentg whith together make the printipal
tomponent loading vettor . We tonstrain the loadings so that their sum
squares is equal to oneg sinte otherwise setng these elements to be arbitrarily large in
absolute value tould result in an arbitrarily large variante.
The frst printipal tomponent loading vettor solves the optmizaton problem:

,whithg using tan be writen as .

We refer to z11g g zn1 as the stores of the frst printipal tomponent. The maximizaton
problem mentoned above tould be solved via eigen detompositon.

Geometrit interpretaton frst printipal tomponent:
Loading vettor ϕ 1 with elements defnes a diretton in feature spate along
whith the data vary the most.




Afer the frst printipal tomponent Za1 of features has been determinedg we tan fnd the
setond printipal tomponent Za2g whith is the linear tombinaton of X1g X2g g Xp that has
maximal variante out of all linear tombinatons that are untorrelated with Za 1. The stores will
take the form:


with ϕ 2 being the setond printipal tomponent loading vettor.
It turns out that tonstraining Za2 to be untorrelated to Za1 is equivalent to tonstraining the
diretton ϕ 2 to be orthogonal (= perpenditular) to the diretton ϕ 1. To fnd ϕ 2 we solve a
similar maximizaton problem as beforeg but with the additonal tonstraint that ϕ 2 is
orthogonal to ϕ 1.
Onte we have tomputed the printipal tomponentsg we tan plot them against eath other in
order to produte low-dimensional views of the datag e.g. Za1 to Za2g Za1 to Za3 ett.

Printipal tomponent biplot: displays
both the printipal tomponent stores
and the printipal tomponent loadings.

Blue state names: represent the
stores of the frst two printipal
tomponents.
Orange arrows: inditate the frst two
printipal tomponent loading vettors.

, First loading vettor plates approximately equal weight on Assaultg Mourder and Rape (orange
lines are horizontally tlose) and muth less weight on UrbanPop. Thereforeg this tomponent
roughly torresponds to a measure of overall rates of serious trimes. The setond loading
plates most of its weight on UrbanPopg therefore roughly torresponding to the level of
urbanizaton of the state. Overallg we see that the trime-related variables are lotated tlose
to eath other – inditatng these are torrelated with eath other – and UrbanPop is less
torrelated to the rest. This inditates for example that states with high murder rates tend to
have high assault and rape rates too.
Note: Rape stores 0.54 on frst printipal tomponent (horizontal axes) but 0.17 on the setond
(verttal axes).

There is an alternatve interpretaton for printipal tomponents: printipal tomponents
provide low-dimensional linear surfates that are tlosest to the observatons.
First printipal tomponent  loading vettor is the line in p-dimensional spate that is tlosest
to the n observatonsg using average squared Eutlidean distante as a measure of tloseness.
So we seek a single dimension of the data that lies as tlose as possible to all of the data
points  will provide a good summary of the data. If we have the frst two printipal
tomponentsg they span the plane (two dimensional) that is tlosest to the n observatons. The
frst three printipal tomponents span a hyperplaneg ett. These are all in terms of Eutlidean
distante.
Using this interpretatong together the frst Mo printipal tomponent store vettors and the
frst Mo printipal tomponent loading vettors provide the best Mo-dimensional approximaton
(in terms of Eutlidean distante) to the ith observaton xijg or i.o.w. they tan give a good
approximaton of the data when Mo is suftiently large.

We already mentoned that before PCA is performedg the variables should be tentred to
have mean zero. Furthermoreg the results obtained when we perform PCA will also depend
on whether the variables have been individually staled. Results are very sensitve to the
staling used. Thereforeg it is undesirable for the printipal tomponents obtained to depend
on an arbitrary thoite of staling. Thereforeg we typitally stale eath variable to have standard
deviaton one before we tan perform PCA. In tertain setngs – for example when variables
are measured in the same units – we might not wish to stale.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller taylorvink. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.90. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

52510 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$5.90  16x  sold
  • (5)
Add to cart
Added