Steven S. Skiena
The Data Science Design Manual
,Steven S. Skiena
Computer Science Department
Stony Brook University
Stony Brook, NY
USA
ISSN 1868-0941 ISSN 1868-095X (electronic)
Texts in Computer Science
ISBN 978-3-319-55443-3 ISBN 978-3-319-55444-0 (eBook)
DOI 10.1007/978-3-319-55444-0
Library of Congress Control Number: 2017943201
© The Author(s) 2017
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
, Preface
Making sense of the world around us requires obtaining and analyzing data from
our environment. Several technology trends have recently collided, providing
new opportunities to apply our data analysis savvy to greater challenges than
ever before.
Computer storage capacity has increased exponentially; indeed remembering
has become so cheap that it is almost impossible to get computer systems to for-
get. Sensing devices increasingly monitor everything that can be observed: video
streams, social media interactions, and the position of anything that moves.
Cloud computing enables us to harness the power of massive numbers of ma-
chines to manipulate this data. Indeed, hundreds of computers are summoned
each time you do a Google search, scrutinizing all of your previous activity just
to decide which is the best ad to show you next.
The result of all this has been the birth of data science, a new field devoted
to maximizing value from vast collections of information. As a discipline, data
science sits somewhere at the intersection of statistics, computer science, and
machine learning, but it is building a distinct heft and character of its own.
This book serves as an introduction to data science, focusing on the skills and
principles needed to build systems for collecting, analyzing, and interpreting
data.
My professional experience as a researcher and instructor convinces me that
one major challenge of data science is that it is considerably more subtle than it
looks. Any student who has ever computed their grade point average (GPA) can
be said to have done rudimentary statistics, just as drawing a simple scatter plot
lets you add experience in data visualization to your resume. But meaningfully
analyzing and interpreting data requires both technical expertise and wisdom.
That so many people do these basics so badly provides my inspiration for writing
this book.
To the Reader
I have been gratified by the warm reception that my book The Algorithm Design
Manual [Ski08] has received since its initial publication in 1997. It has been
recognized as a unique guide to using algorithmic techniques to solve problems
that often arise in practice. The book you are holding covers very different
material, but with the same motivation.
, In particular, here I stress the following basic principles as fundamental to
becoming a good data scientist:
• Valuing doing the simple things right: Data science isn’t rocket science.
Students and practitioners often get lost in technological space, pursuing
the most advanced machine learning methods, the newest open source
software libraries, or the glitziest visualization techniques. However, the
heart of data science lies in doing the simple things right: understanding
the application domain, cleaning and integrating relevant data sources,
and presenting your results clearly to others.
Simple doesn’t mean easy, however. Indeed it takes considerable insight
and experience to ask the right questions, and sense whether you are mov-
ing toward correct answers and actionable insights. I resist the temptation
to drill deeply into clean, technical material here just because it is teach-
able. There are plenty of other books which will cover the intricacies of
machine learning algorithms or statistical hypothesis testing. My mission
here is to lay the groundwork of what really matters in analyzing data.
• Developing mathematical intuition: Data science rests on a foundation of
mathematics, particularly statistics and linear algebra. It is important to
understand this material on an intuitive level: why these concepts were
developed, how they are useful, and when they work best. I illustrate
operations in linear algebra by presenting pictures of what happens to
matrices when you manipulate them, and statistical concepts by exam-
ples and reducto ad absurdum arguments. My goal here is transplanting
intuition into the reader.
But I strive to minimize the amount of formal mathematics used in pre-
senting this material. Indeed, I will present exactly one formal proof in
this book, an incorrect proof where the associated theorem is obviously
false. The moral here is not that mathematical rigor doesn’t matter, be-
cause of course it does, but that genuine rigor is impossible until after
there is comprehension.
• Think like a computer scientist, but act like a statistician: Data science
provides an umbrella linking computer scientists, statisticians, and domain
specialists. But each community has its own distinct styles of thinking and
action, which gets stamped into the souls of its members.
In this book, I emphasize approaches which come most naturally to com-
puter scientists, particularly the algorithmic manipulation of data, the use
of machine learning, and the mastery of scale. But I also seek to transmit
the core values of statistical reasoning: the need to understand the appli-
cation domain, proper appreciation of the small, the quest for significance,
and a hunger for exploration.
No discipline has a monopoly on the truth. The best data scientists incor-
porate tools from multiple areas, and this book strives to be a relatively
neutral ground where rival philosophies can come to reason together.