Summary Data Science 2019/2020
Lecture I – Introduction to Data Analytics and Supervised segmentation
Knowledge Discovery Process
The term Knowledge Discovery in Databases, or KDD for
short, refers to the broad process of finding knowledge in
data, and emphasize the ‘high-level- application of
particular data mining methods. It is of interest to
researchers in machine learning, pattern recognition,
databases, statistics, AI, knowledge acquisition for expert
systems and data visualization.
- Goal: to extract (identify) knowledge from data in
the context of large databases.
- It does this by using data mining methods
(algorithms) to extract (identify) what is deemed
knowledge, according to the specifications of
measures and thresholds, using a database along
with any required preprocessing, subsampling and
transformation of the database.
KDD refers to the overall process of discovering useful knowledge from data. It involves the evaluation and possibly
interpretation of the patterns to make the decision of what qualifies as knowledge. It also includes the choice of
encoding schemes, preprocessing, sampling, and projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data without the additional steps
of the KDD process.
Industrial symbiosis recommender
Industrial symbiosis aims to stimulate or enhance
cooperation between industrial firms to utilize industrial
waste streams from other industries and to share related
knowledge, in order to achieve sustainable production.
Recommenders can support industries through the
identification of item opportunities in waste
marketplaces, enhancing activities that may lead to the
development of an active waste exchange network.
Staticstics, Machine Learning and Data mining
Statistics (top down)
- More theory based
- More model based
- More focused on testing hypothesis
Machine learning (bottom up)
- More heuristic
- Focused on improving performance of a learning agent
- Also looks at real-time learning and robotics – areas not part of data mining
Data mining and knowledge discovery
- Integrates theory and heuristics
1
, - Focus on the entire process of knowledge discovery, including data cleaning, learning, and integration
and visualization of results
Data Mining versus..
Data warehouse/ storage
- Data warehouses coalesce data from across a enterprise, often from multiple transaction-process
systems
Querying/ reporting (SQL, Excel, QBE, other GUI-base querying)
- Very flexible interface to ask factual questions about data
- No modeling or sophisticated pattern finding
- Most of the cool visualization
OLAP – On-line Analytical processing
- OLAP provides a easy-to-use GUI to explore large data collections
- Exploration is manual; no modeling
- Dimensions of analysis preprogrammed into OLAP system.
Data warehouse vs. data base (Online transaction processing)
Data warehouse is an OLAP, analytical processing to find out how your company is doing. Strategic decisions can
be made by data warehouse, by online analytical processing. Data base is a technical system that handles your
transactions. All of that is in your data base.
Types of Machine learning
1. Supervised learning (baby, dog; train the algorithm, which eventually learns from that data)
a. Classification: classify different items (two way; pass or fail)
b. Regression: percentage score
2. Unsupervised learning (just gives the algorithm, no training, just data)
3. Reinforcement learning (learn from the data in ‘a loop’. This is how AI works)
Terminology
- In a Colom there is always a target attribute and the rest attributes
- Dimensionality of a dataset is the sum of the dimension of the features. The sum of the numbers of
numeric features and the number of values of categorical features.
Data mining tasks
- Classification (data mining tasks): learn a method for
predicting the instance class form pre-labeled classified
instances.
- Categorical – values or observations that can be sorted
into groups or categories. Bar charts and pie graphs are
used to graph categorical data.
o Nominal – nominal values or observations can
be assigned a code in the form of a number
where the numbers are simply labels. You can
count but not order or measure nominal data.
(eye color dog, breed, blood type)
o Ordinal – ordinal values or observations can be
ranked (put in order) or have a rating scale attached. You can count and order, but not measure
ordinal data. (satisfaction rating, mood, pain severity)
- Numerical
o Interval – differences between measurements but no true zero. Difference between
measurement is constant (temperature)
o Ratio - differences between measurements, however true zero exists. (length, salary, number
of children)
2
,How does Data mining/ Machine Learning work?
Data mining extracts patters from data. When there is a pattern, there is a mathematical (numeric and/ or symbolic)
relationship among data items. Types of patters are:
- Prediction through classification and regression
- Association trough link analysis and sequence analysis
- Clustering through cluster analysis
- Sequential (or time series) relationships.
Chapter 1 – Data-analytic
Throughout the first two chapters of this book, we will discuss in detail various topics and techniques related to
data science and data mining.
At a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data.
Data mining is the extraction of knowledge from data, via technologies that incorporate these principles. As a term,
“data science” often is applied more broadly than the traditional use of “data mining,” but data mining techniques
provide some of the clearest illustrations of the principles of data science.
Churn: customers switching from one company to another company. It is expensive all around; one company must
spend on incentives to attract a customer while another company loses revenue when the customer departs.
- Customer retention has been a major use of data mining technologies – especially in telecommunications
and finance businesses. These are generally more were some of the earliest and widest adopters of data
mining technologies enter.
Data science, Engineering and Data-driven Decision making
Data science involves principles, processes and techniques for
understanding phenomena via the automated analysis of data.
Data-driven decision-making (DDD): refers to the practice of basing
decisions on the analysis of data, rather than purely on intuition. For
example, a marketer could select advertisements based purely on her long
experience in the field and her eye for what will work. Or, she could base
her selection on the analysis of data regarding how consumers react to
different ads. She could also use a combination of these approaches. DDD
is not an all-or-nothing practice, and different firms engage in DDD to
greater or lesser degrees.
Benefits of DDD
- The more data-driven a firm is, the more productive it is
- DDD is correlated with higher return on assets, return on equity,
asset utilization and market value.
Two different types of decisions when/why using DDD
1. Decisions for which ‘discoveries’ need to be made within data (Target case. They were interested in
whether they could predict that people are expecting a baby)
2. Decisions that repeat, especially at massive scale, and so decision-making can benefit from even small
increases in decisions-making accuracy based on data analysis. (MegaTelCo’s churn rate by improving
their ability to estimate, for a given customer, how profitable it would be for them to focus on her, they
can potentially reap large benefits by applying this by ability to the millions of customers in the population)
Predictive model: abstracts away most of the complexity of the world, focusing in on a particular set of indications
that correlate in some way with a quantity of interest (who will churn, who will purchase)
3
, Automated DDD: different industries have adopted automatic decisions-making at different rates. The finance and
telecommunications industries were early adopters, largely because of their precocious development of data
networks and implementation of massive scale computing, which allowed the aggregation and modeling of data
at a large scale, as well as the application of the resultant models to decision-making
Data Processing and “Big Data”
To understand data science and data-driven businesses it is important to understand the differences. Data science
needs access to data and it often benefits from sophisticated data engineering that data processing technologies
may facilitate, but these technologies are not data science technologies per se. They support data science, but
they are useful for much more. Data processing technologies are very important for many data-oriented business
tasks that do not involve extracting knowledge or data-driven decision-making, such as efficient transaction
processing, modern web system processing, and online advertising campaign management.
“Big data” technologies: big data essentially means datasets that are too large for traditional data processing
systems, and therefore require new processing technologies. Big data technologies are used for many tasks,
including data engineering. Occasionally, big data technologies are actually used for implementing data mining
techniques. However, much more often the well-known big data technologies are used for data processing in
support of the data mining techniques and other data science activities.
From Big Data 1.0 to Big Data 2.0
Once firms had incorporated Web 1.0 technologies thoroughly (and in the process had driven down prices of the
underlying technology) they started to look further. They began to ask what the Web could do for them, and how
it could improve things they’d always done- and we entered the era of Web 2.0, where new systems and
companies began taking advantage of the interactive nature of the Web. The changes brought on by this shift in
thinking are pervasive; the most obvious are the incorporation of social-networking components, and the rise of
the ‘voice’ of the individual consumer (and citizen)
Data and Data science capability as a strategic asset.
The prior sections suggest one of the fundamental principles of data science: data, and the capability to extract
useful knowledge from data, should be regarded as key strategic assets. Too many businesses regard data analytics
as pertaining mainly to realizing value from some existing data, and often without careful regard to whether the
business has the appropriate analytical talent. Viewing these as assets allows us to think explicitly about the extent
to which one should invest in them. The best data science team can yield little value without the appropriate data;
the right data often cannot substantially improve decisions without suitable data science talent. As with all assets,
it is often necessary to make investments. Building a top-notch data science team is a nontrivial undertaking, but
can make a huge difference for decision-making
- Amazon was able to gather data early on online customers, which has created significant switching costs:
consumers find value in the rankings and recommendations that Amazon provides. Amazon therefore
can retain customers more easily and can even charge a premium.
Data analytic thinking
Analyzing case studies such as the churn problem improves our ability to approach problems “data-analytically”.
When faced with a business problem, you should be able to assess whether and how data can improve
performance. Data analysis is now so critical to business strategy. Businesses increasingly are driven by data
analytics, so there is great professional advantage in being able to interact competently with and within such
businesses. Understanding the fundamental concepts and having frameworks for organizing data-analytic thinking
not only will allow one to interact competently but will help to envision opportunities for improving data-driven
decision-making, or to see data-oriented competitive threats.
- Business can get leverage from a data science team for making better decisions in multiple areas of the
business. However, as McKinsey is pointing out, the managers in those areas need to understand the
fundamentals of data science to effectively get that leverage.
4