College aantekeningen

Lecture notes Introduction to Data Science Supervised learning (lecture 1-5) (UvT)

52 keer bekeken 0 keer verkocht

Instelling
Tilburg University (UVT)

Study: Bachelor Psychology Tilburg University Major/minor: Psychological Methods and Data Science/Applied Advanced Research Methods Course: Introduction to Data Science 2020/2021 (jaar 3) Professor: Kyle Lang Based on: Lectures and slides This course is divided in two parts: 1. Supervised l...

[Meer zien]

Laatste update van het document: 3 jaar geleden

Voorbeeld 4 van de 51 pagina's

Bekijk voorbeeld

Geupload op 29 april 2021
Bestand laatst geupdate op 30 september 2021
Aantal pagina's 51
Geschreven in 2020/2021
Type College aantekeningen
Docent(en) Kyle lang
Bevat 1 t/m 5

data science
psychology
statistics
regression
rstudio
supervised learning
lasso
ridge
psychologie
bachelor
tilburg university

Volgen

melissauvt Lid sinds 4 jaar 338 documenten verkocht

€2,99

Ook beschikbaar in voordeelbundel v.a. €5,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Ook beschikbaar in voordeelbundel (2)

Introduction to Data Science lectures 1-9 (UvT)

€ 5,98 € 5,49

5x verkocht

2 items

1. College aantekeningen - Lecture notes introduction to data science supervised learning (lecture 1-5) (uvt)
2. College aantekeningen - Lecture notes introduction to data science unsupervised learning (lecture 6-9) (uvt)
Meer zien

Lecture notes minor Applied Advanced Research Methods (UvT)

€ 12,46 € 7,49

2x verkocht

4 items

1. College aantekeningen - Lecture notes introduction to data science supervised learning (lecture 1-5) (uvt)
2. College aantekeningen - Lecture notes introduction to data science unsupervised learning (lecture 6-9) (uvt)
3. College aantekeningen - Lecture notes topics in causal analysis (part about mediation, moderation and cpa) (u...
4. College aantekeningen - Lecture notes topics in causal analysis (part about multi level analysis) (uvt)
Meer zien

Lecture notes Supervised learning
Course: Introduction to Data Science 2021/2022
Study: Bachelor Psychology Tilburg University
Lecture 1
The meaning of Data Science
Data science is a buzzword: everyone in the world is looking for a data scientist. It’s a
hot topic. Data science is not the science of data. We’re simply using data to do
science (which is the meaning of science). It exists
out of a mixture of skills and a merger of disciplines:
statistics, computer science, mathematics,
programming, data processing, data visualization,
communication, and substantive expertise. The last
two are soft skills that are quite unique for data
science. You’re not just a consultant who brings
nothing but statistical/analytical skills; you also need
to understand your product and be able to tell
something about it. The picture on the right is a
pyramid that shows us the stages of data science. The bottom two layers describe
the job of data engineers. They make data available and analysable. The middle
layers describe the traditional data scientist. At the highest layer you are busy with
computer science and deep learning. If you want to learn all of the data science
disciplines, the less you actually know about a single discipline. That’s why there are
data science teams that are specialized in a discipline (or a few disciplines). Because
data science is a merge of disciplines, there are a lot of terms that actually refer to
the same thing (confusing nomenclature).
The Data Science mindset
What is Data Science? First of all, it’s a focus on practical problem solving.
- Data science should create value.
- We’re trying to solve real world problems and extract knowledge of data.
- Start with a question and use data to answer it. (Is it going to rain tomorrow?)
- Don’t start with data and generate answerable questions.
- Use appropriately complex methods.
- Don’t waste resources on complex analyses when simpler analyses will solve
your problem equally well.
- Don’t settle for bad answers just because good answers will require
complex/difficult analyses.
- Don’t ask if you can; ask if you should.
- Why are you doing a particular analysis?
- All analytic decisions should be justified.
Secondly, it has a strong focus on pragmatism and scepticism.
- Don’t be tied to a “pet method”. Analyse your data with the right method. For
example if you are Bayesian that doesn’t mean the best thing is to apply Bayesian
statistics on all your data.
- Embrace exploratory methods.
- Don’t overgeneralize exploratory findings.

,- Treat neither data nor theory as sacred.
- Don’t sanctify theory in the face of (definitively) contradictory data.
- Don’t blithely let data overrule well-supported theory.
- Trust no one.
- Not data, other people, or yourself.
- Check and double check.
- Don’t assume what can be tested.
- When in doubt, take the conservative approach.
- Document everything! (syntax)
Lastly, it’s a fast-paced, curious, open-minded attitude.
- Iterate quickly, fail quickly: if you’re trying something new, make sure you fail fast.
Because if it’s not going to work, you want to know sooner, rather than later.
- Never stop learning.
- Learn and use new methods.
- Always remain open to new ideas/approaches.
- Don’t be afraid to tackle new problems.
- Generalize and extend what you know.
- Don’t stagnate.
- Show an appropriate degree of humility.
- You don’t know everything. Embrace and correct your ignorance.
- Ask questions. Communicate and don’t just talk.

, Data Science workflow
This is a representation of the Research
cycle used for empirical research in
most of the sciences:

And this is the Data Science Cycle (O’Neill & Schutt, 2014). The grey circles must
occur in every data analysis! A data product is something you deploy to the world,
like a spam filter for example. EDA finds problems in your data, that you need to fix in
the previous circles/steps. If you already have clean data, EDA is not necessary and
you can go straight to modeling or to report findings (if you just want to show data
and don’t do any analysis).
Data Science novelties
In the social and behavioural sciences, we are accustomed to analysing small,
rectangular datasets. The rows represent observational units, like individuals, and the
columns represent the variables. Data science deals with much more diverse data:
- Relational databases: like an access database or a sequel database. The idea is
that you have a bunch of different rectangular data sets, but they’re not joint together.
For example, you have a dataset for men and for females.
- Data streams: are becoming more common in psychology. You have data
continuously coming in. The data set is not fixed and you have new information
coming in, in real time.
- Web logs: these are data that are immediately logged. Like browsing history,
purchasing history etcetera.
- Sensor data
- Image data: usually kind of rectangular. An image can be broken into pixels and
each pixel has a certain saturation. The saturation has a certain number, so a picture

, van be broken into a set of numbers (rectangular).
- Unstructured text: text that you have on a page. Computational linguistics analyse
unstructured text.
These datasets are often much larger and less structured than those traditionally
analysed in the social and behavioural sciences. When dealing with large amounts of
(distributed) data, we should move the data as little as possible. Data is often
distributed: the data is stored on different computers, different servers. We can
analyse distributed data in situ (where they at) without moving them to a central
computer. You run the results at every location and move the partial results, instead
of the data (distributed computing). When executing long-running jobs, we should try
to split the calculations into smaller pieces that can be executed simultaneously, also
known as parallel processing. This way of processing has two options:
embarrassingly parallel (it’s not more work to run the data in parallel) en multi-
threading (advanced form of parallel processing). We can distribute embarrassingly
parallel jobs directly, there’s no real need for clever task partitioning. Multi-threaded
jobs need to be broken into independent subtasks. Independent means that the
analysation of subtasks doesn’t need to wait for the results of another to be executed
(it needs to run simultaneously). There are several technologies that can facilitate
multi-threading and help you with splitting up the data:
- Small scale (not very big distributed data): Message Passing Interface (MPI) and
Open Multi Processing (OpenMP).
- Large Scale (big data): Google’s MapReduce algorithm, Apache Hadoop and
Apache Spark.
Example embarrassing parallel processing
Run a Monte Carlo simulation to test the central limit
theorem: this basically says that the sum/mean of a
sequence of independent, random variables is going to
approach a normal distribution as the sum/mean gets
larger. So if you have a certain amount of individuals for
example you can assume that your distribution is a normal
distribution.
- Population model: xp ∼ Γ(0.5,1.0) (gamma distribution, as
you can see in the picture on the right it’s not normally
distributed or even close to that)
- Parameters: P ∈ {1,2,...,50} and N ∈ {5,10,...,100}
p
- Mean score for the nth row: x̄ n = P−1 Σ xnp
p=1
- Outcome: KS statistic testing if x̄ n is normally distributed
The first thing we need to do is to define a function to run one replication of the
simulation:

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper melissauvt. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €2,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 59063 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

College aantekeningen

Lecture notes Introduction to Data Science Supervised learning (lecture 1-5) (UvT)

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?