WEEK 1
Why DS? Challenges:
1. Inability to build bridges b/w business and IT
2. Existence of a lot of data
3. Organisations are increasingly complex due to information and changing environments
Data warehouse → selecting & cleaning → transformation (70% of the time) → data mining →
interpretation & evaluation → knowledge/understanding
Machine learning, statistics and data mining
● Statistics:
○ More theory- and model-based
○ More focused on testing hypotheses
● ML
○ More heuristic
○ Focused on improving the performance of a learning agent
○ Also looks at real-time learning and robotics - areas not part of data mining
● Data mining and knowledge discovery
○ Integrates theory and heuristics
○ Focus on the entire process of knowledge discovery, including data cleaning,
learning, and integration and visualization of results
● Fundamental difference b/w ML and statistics is that ML is a bottom up approach and
statistics a top down approach
○ Statistics is an explanatory model not optimized to extend data to make
predictions and a ML model is a predictive model that also helps predict the
future
Data warehousing/storage: Coalesce data from across an enterprise, often from multiple
transaction-processing systems
Querying/reporting: Very flexible interface to ask factual questions about data
● No modeling or sophisticated pattern finding
● E.g., SQP, QBE
OLAP (Online analytical Processing)
● Provides easy-to-use GUI to explore large data collections
● Exploration is manual; no modeling
● Dimensions of analysis pre-programmed into OLAP system
Types of ML
1. Supervised learning
a. Classification
b. Regression
2. Unsupervised learning
3. Reinforcement learning: Mix. Learn from a loop of learning
, Terminology:
● Columns → attributes or features
● Variable or target attribute: What you want to predict
● Dimensionality of a dataset is the sum of the dimensions of the features
○ So number of columns (attributes, variables or features)
○ The more dimensions the harder it is to analyse the data
Data → categorical or numerical
● Categorical: nominal (e.g., binomial) or ordinal (ranking in classes)
● Numerical: interval (data where the zero-point is not fixed, e.g., temperature) or ratio
(fixed zero-point, can be divided, e.g., salary, height)
DM extracts patterns from data
● Some tasks can be done by using either supervised or unsupervised methods (e.g.,
similarity matching, link prediction, data reduction) and algorithms (e.g., artificial neural
networks (ANN))
WEEK 2
Decision trees: Fundamental and important algorithm in data science
Classification goal: Classify new data in existing categories
Classification techniques examples: Statistical analysis, decision tree analysis, support vector
machines, case-based reasoning, neural networks, Bayesian classifiers, genetic algorithms,
rough sets
Classification: Linear regression
● w0 + w1x + w2y ≥ 0
● Regression computes wi from data to minimize squared error to 'fit' the data
● Does not really help categorize x, just how close the dot or x is to the line
Decision tree classification task:
● Training set —> induction —> tree induction algorithm (learn model) —> model (decision
tree —> apply model —> deduction -- > test set
● No loops
● Each child cannot have more than one parent
Creating decision trees:
● Employs the divide and conquer method
● Recursively divides a training set until each division consists of examples from one
class
1. Create a root node and assign all of the training data to it
2. Select the best splitting attribute
3. Add a branch to the root node for each value of the split. Split the data into mutually
exclusive subsets along the lines of the specific split
4. Repeat steps 2 and 3 for each and every leaf node until the stopping criteria is reached