For the course Introduction to Data Science, you get a lot of extra reading material (articles, papers, etc.). It has helped me quite a bit to summarise (or at least make an overview of) this material. In the test, they ask a considerable amount of questions about this, so it's nice for you to read...
A useful taxonomy for data science would be OSEMN: Obtain, Scrub, Explore, Model and iNterpret. Ideally, a
data scientist should be at home with them all.
OBTAIN
Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly form
multiple sources. At least, one should know how to do this in a UN*X environment or in Python. Also, one
should be familiar with APIs (application programming interface).
SCRUB
There will be almost always some amount of data cleaning (or scrubbing) necessary before analysis of these
data is possible. It is the least sexy part of the analysis process, but often that yields the greatest benefits. A
simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.
EXPLORE
Visualizing (e.g. histograms and scatter plots), clustering, performing dimensionality reduction (e.g. PCA): these
are all part of ‘looking at data’. No hypothesis is being tested and no predictions are attempted. They are quite
useful for getting to know your data.
MODEL
Often, the ‘best’ model is the most predictive model. One can leave out a fraction of the data (the validation or
test set), learn/optimize a model using the remaining data (the learning or training set) by minimizing a chosen
loss function and evaluate this or another loss function on the validation data → cross validation. Models are
built to predict and to interpret. The former can be assessed quantitively, the latter cannot.
INTERPRET
The predictive power of a model lies in its ability to generalize in the quantitative sense: to make accurate
quantitative predictions of data in new experiments. The interpretability of a model lies in its ability to
generalize in the qualitative sense: to suggest to the modeler which would be the most interesting experiments
to perform next.
CONLUSION
Data science is clearly a blend of the hackers’ arts (primarily in steps “O” and “S” above); statistics and machine
learning (primarily steps “E” and “M” above); and the expertise in mathematics and the domain of the data for
the analysis to be interpretable.
The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as
such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and
where data science fits. It is clear, however, that one needs to learn a lot as they aspire to become a fully
competent data scientist.
HOW TO READ THE DATA SCIENCE VENN DIAGRAM
• Data science consists is interdisciplinary. Hacking skills, math & stats knowledge and substantive
expertise are on their own very valuable, but when combined with only one other are at best simply
not data science, or at worst downright dangerous.
• Hacking skills: Data is a commodity traded electronically. Hence, it is handy to “speak hacker”. Being
able to manipulate text files at the command-line, understanding vectorized operations and thinking
algorithmically are the hacking skills that make for a successful data hacker.
• Math & Statistics Knowledge: Having acquired and cleaned the data, one should get look for insights.
For this, you need to apply appropriate math and statistical methods.
• Substantive Expertise: Science is about discovery and building knowledge, which requires some
motivating questions about eh world and hypotheses that can be brought to data and tested with
statistical methods.
• Danger zone: people who can make a linear regression, but do not know what the coefficients mean.
CRISP-DM stands for cross-industry process for data mining. This methodology provides a structured approach
to planning a data mining project. This model is an idealised sequence of events. In practice many of the tasks
can be performed in a different order and it will often be necessary to backtrack to previous tasks and repeat
certain actions.
STAGE 1: DETERMINE BUSINESS OBJECTIVES
WHAT ARE THE DESIRED OUTPUTS OF THE PROJECT?
1. Set objectives. This means describing your primary objective from a business perspective.
2. Produce project plan. The plan should specify the steps to be performed during the rest of the project,
including the initial selection of tools and techniques.
3. Business success criteria. Here you’ll lay out the criteria that you’ll use to determine whether the project
has been successful from the business point of view. → Specific & measurable.
ASSESS THE CURRENT SITUATION
1. Inventory of resources → personnel, data, computing resources and software.
2. Requirements, assumptions and constraints → e.g. the GDPR and constraints on the availability of
resources.
3. Risks and contingencies → risks that might delay the project.
4. Terminology → compile a glossary of terminology relevant to the project.
5. Costs and benefits → financial measures in a commercial situation.
DETERMINE DATA MINING GOALS
1. Business success criteria → states objectives in business terminology. Describe the intended outputs of
the project that enable the achievement of the business objectives.
2. Data mining success criteria → states project objectives in technical terms, for example: a certain level of
predictive accuracy.
PRODUCE PROJECT PLAN
3
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller berendmarkhorst. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $8.03. You're not tied to anything after your purchase.