100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary Reading Material €7,49
In winkelwagen

Samenvatting

Summary Reading Material

 106 keer bekeken  4 keer verkocht

For the course Introduction to Data Science, you get a lot of extra reading material (articles, papers, etc.). It has helped me quite a bit to summarise (or at least make an overview of) this material. In the test, they ask a considerable amount of questions about this, so it's nice for you to read...

[Meer zien]

Voorbeeld 4 van de 34  pagina's

  • 18 december 2019
  • 34
  • 2019/2020
  • Samenvatting
Alle documenten voor dit vak (4)
avatar-seller
berendmarkhorst
Summary Reading Material
Introduction to Data Science

2019

,LECTURE 1 - A TAXONOMY OF DATA SCIENCE

http://www.dataists.com/2010/09/a-taxonomy-of-data-science/

A useful taxonomy for data science would be OSEMN: Obtain, Scrub, Explore, Model and iNterpret. Ideally, a
data scientist should be at home with them all.

OBTAIN

Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly form
multiple sources. At least, one should know how to do this in a UN*X environment or in Python. Also, one
should be familiar with APIs (application programming interface).

SCRUB

There will be almost always some amount of data cleaning (or scrubbing) necessary before analysis of these
data is possible. It is the least sexy part of the analysis process, but often that yields the greatest benefits. A
simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.

EXPLORE

Visualizing (e.g. histograms and scatter plots), clustering, performing dimensionality reduction (e.g. PCA): these
are all part of ‘looking at data’. No hypothesis is being tested and no predictions are attempted. They are quite
useful for getting to know your data.

MODEL

Often, the ‘best’ model is the most predictive model. One can leave out a fraction of the data (the validation or
test set), learn/optimize a model using the remaining data (the learning or training set) by minimizing a chosen
loss function and evaluate this or another loss function on the validation data → cross validation. Models are
built to predict and to interpret. The former can be assessed quantitively, the latter cannot.

INTERPRET

The predictive power of a model lies in its ability to generalize in the quantitative sense: to make accurate
quantitative predictions of data in new experiments. The interpretability of a model lies in its ability to
generalize in the qualitative sense: to suggest to the modeler which would be the most interesting experiments
to perform next.

CONLUSION

Data science is clearly a blend of the hackers’ arts (primarily in steps “O” and “S” above); statistics and machine
learning (primarily steps “E” and “M” above); and the expertise in mathematics and the domain of the data for
the analysis to be interpretable.




1

,LECTURE 1 - THE DATA SCIENCE VENN DIAGRAM

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as
such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and
where data science fits. It is clear, however, that one needs to learn a lot as they aspire to become a fully
competent data scientist.




HOW TO READ THE DATA SCIENCE VENN DIAGRAM

• Data science consists is interdisciplinary. Hacking skills, math & stats knowledge and substantive
expertise are on their own very valuable, but when combined with only one other are at best simply
not data science, or at worst downright dangerous.
• Hacking skills: Data is a commodity traded electronically. Hence, it is handy to “speak hacker”. Being
able to manipulate text files at the command-line, understanding vectorized operations and thinking
algorithmically are the hacking skills that make for a successful data hacker.
• Math & Statistics Knowledge: Having acquired and cleaned the data, one should get look for insights.
For this, you need to apply appropriate math and statistical methods.
• Substantive Expertise: Science is about discovery and building knowledge, which requires some
motivating questions about eh world and hypotheses that can be brought to data and tested with
statistical methods.
• Danger zone: people who can make a linear regression, but do not know what the coefficients mean.




2

, LECTURE 2 - WHAT IS THE CRISP-DM METHODOLOGY?

https://www.sv-europe.com/crisp-dm-methodology/#dataunderstanding

CRISP-DM stands for cross-industry process for data mining. This methodology provides a structured approach
to planning a data mining project. This model is an idealised sequence of events. In practice many of the tasks
can be performed in a different order and it will often be necessary to backtrack to previous tasks and repeat
certain actions.




STAGE 1: DETERMINE BUSINESS OBJECTIVES


WHAT ARE THE DESIRED OUTPUTS OF THE PROJECT?
1. Set objectives. This means describing your primary objective from a business perspective.
2. Produce project plan. The plan should specify the steps to be performed during the rest of the project,
including the initial selection of tools and techniques.
3. Business success criteria. Here you’ll lay out the criteria that you’ll use to determine whether the project
has been successful from the business point of view. → Specific & measurable.


ASSESS THE CURRENT SITUATION
1. Inventory of resources → personnel, data, computing resources and software.
2. Requirements, assumptions and constraints → e.g. the GDPR and constraints on the availability of
resources.
3. Risks and contingencies → risks that might delay the project.
4. Terminology → compile a glossary of terminology relevant to the project.
5. Costs and benefits → financial measures in a commercial situation.


DETERMINE DATA MINING GOALS
1. Business success criteria → states objectives in business terminology. Describe the intended outputs of
the project that enable the achievement of the business objectives.
2. Data mining success criteria → states project objectives in technical terms, for example: a certain level of
predictive accuracy.


PRODUCE PROJECT PLAN


3

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper berendmarkhorst. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €7,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 53068 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€7,49  4x  verkocht
  • (0)
In winkelwagen
Toegevoegd