Article 1: 50 years of Data Science
Q1: How does data science differ from and overlap with ‘traditional’ statistics?
Data Science and ‘traditional’ statistics overlap in both working with large datasets (i.e.,
‘big data’). We can immediately reject ‘big data’ as a criterion for meaningful distinction
between statistics and data science
• “Data Scientist" means a professional who uses scientific methods to liberate and
create meaning from raw data.
• “Statistics" means the practice or science of collecting and analyzing numerical data
in large quantitie.
Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman urged academic
statistics to expand its boundaries beyond the classical domain of theoretical statistics;
Chambers called for more emphasis on data preparation and presentation rather than
statistical modeling; and Breiman called for emphasis on prediction rather than inference.
Cleveland even suggested the catchy name “Data Science” for his envisioned field.
The statistics profession faces a choice in its future research between continuing
concentration on traditional topics – based largely on data analysis supported by
mathematical statistics – and a broader viewpoint – based on an inclusive concept of
learning from data. In short, academic Statisticians were exhorted repeatedly across the
years to change paths, towards a much broader definition of their field.
The statistical community has been committed to the almost exclusive use of [generative]
models. This commitment has led to irrelevant theory, questionable conclusions, and has
kept statisticians from working on a large range of interesting current problems. [Predictive]
modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can
be used both on large complex data sets and as a more accurate and informative alternative
to data modeling on smaller data sets. If our goal as a field is to use data to solve problems,
then we need to move away from exclusive dependence on [generative] models ...
There are two goals in analyzing the data:
• Prediction. To be able to predict what the responses are going to be to future input
variables;
• [Inference]. To [infer] how nature is associating the response variables to the input
variables.
Breiman says that users of data split into two cultures, based on their primary allegiance to
one or the other of these goals.
• The ‘Generative Modeling’ culture seeks to develop stochastic models which fit the
data, and then make inferences about the data-generating mechanism based on the
structure of those models. According to Breiman, this is what statistics is.
• The ‘Predictive Modeling’ culture prioritizes prediction and is estimated by Breiman
to encompass 2% of academic statisticians - including Breiman - but also many
computer scientists and important industrial statisticians. According to Cleveland, this
is suggested to be Data Science.
1
,Q2: What are the main activities of (greater) Data Science?
The larger vision posits a professional on a quest to extract information from data. The larger
field cares about each and every step that the professional must take, from getting
acquainted with the data all the way to delivering results based upon it, and extending even
to that professional’s continual review of the evidence about best practices of the whole field
itself. Lesser Data Science’ (LDS) and the larger would-be field Greater Data Science
(GDS).
The Six Divisions
The activities of Greater Data Science are classified into 6 divisions:
• Data Exploration and Preparation
• Data Representation and Transformation
• Computing with Data
• Data Modeling
• Data Visualization and Presentation
• Science about Data Science
GDS1: Data Exploration and Preparation. Some say that 80% of the effort devoted to data
science is expended by diving into or becoming one with one’s messy data to learn the
basics of what’s in them, so that data can be made ready for further exploitation
We identify two subactivities:
• Exploration: every data scientist devotes serious time and effort to exploring data to
sanity-check its most basic properties, and to expose unexpected features.
• Preparation: one speaks colorfully of data cleaning.
GDS2: Data Representation and Transformation. A data scientist works with many
different data sources during a career. Data scientists very often find that a central step in
their work is to implement an appropriate transformation restructuring the originally given
data into a new and more revealing form.
Data Scientists develop skills in two specific areas:
• Modern Databases: Data scientists need to know the structures, transformations,
and algorithms involved in using all these different representations.
• Mathematical Representations. These are interesting and useful mathematical
structures for representing data of special types, including acoustic, image, sensor,
and network data. For example, to get features with acoustic data, one often
transforms to the cepstrum or the Fourier transform; for image and sensor data the
wavelet transform or some other multi scale transform (e.g. pyramids in deep
learning)
GDS3: Computing with Data. Every data scientist should know and use several languages
for data analysis and data processing.
• Data scientists develop workflows which organize work to be split up across many
jobs to be run sequentially or else across many machines.
• Data scientists also develop workflows that document the steps of an individual data
analysis or research project.
2
, GDS4: Data Visualization and Presentation. Data visualization at one extreme overlaps
with the very simple plots of EDA - histograms, scatterplots, time series plots. Data scientists
also create dashboards for monitoring data processing pipelines that access streaming or
widely distributed data. Finally they develop visualizations to present conclusions from a
modeling exercise or CTF challenge.
GDS5: Data Modeling. Each data scientist in practice uses tools and viewpoints from both
of Leo Breiman’s modeling cultures:
• Generative modeling, in which one proposes a stochastic model that could have
generated the data, and derives methods to infer properties of the underlying
generative mechanism. This roughly speaking coincides with traditional Academic
statistics and its offshoots
• Predictive modeling, in which one constructs methods which predict well over some
some given data universe – i.e. some very specific concrete dataset. This roughly
coincides with modern Machine Learning, and its industrial offshoots
GDS6: Science about Data Science. Tukey advocated the study of what data analysts ‘in
the wild’ are actually doing, and reminded us that the true effectiveness of a tool is related to
the probability of deployment times the probability of effective results once deployed.
The scope here also includes foundational work to make future such science possible –
such as encoding documentation of individual analyes and conclusions in a standard digital
format for future harvesting and meta analysis.
In particular, meta-analysts have learned that a dismaying fraction of the conclusions in the
scientific literature are simply incorrect (i.e. far more than 5%) and that most published
effects sizes are overstated, that many results are not reproducible, and so on.
Our government spends tens of billions of dollars every year to produce more than 1 million
scientific articles. It approaches cosmic importance, to learn whether science as actually
practiced is succeeding or even how science as a whole can improve.
Q3: How does (open) data science generate new scientific opportunities?
In principle, the purpose of scientific publication is to enable reproducibility of research
findings
To meet the original goal of scientific publication, one should share the underlying code and
data. Moreover there are benefits to authors. Working from the beginning with a plan for
sharing code and data leads to higher quality work, and ensures that authors can access
their own former work, and those of their co-authors, students and postdocs
To work reproducibly in today’s computational environment, one constructs automated
workflows which generate all the computations and all the analyses in a project. As a
corollary, one can then easily and naturally refine and improve earlier work continuously.
Reproducibility of computational experiments is just as important to industrial data science
as it is to scientific publication. It enables a disciplined approach to proposing and evaluating
potential system improvements and an easy transition of validated improvements into
production use.
3