Data Mining for Business and Governance
Introduction to Data Science (Week 1, Video 1)
What is Data Science?
= Is a “concept to unify statistics, data analysis and their related methods” in order to “understand and
analyze actual phenomena” with data.
What makes a Data Scientist?
= Data scientist use their data and analytical ability to find and interpret rich data sources; manage large
amounts of data (…); create visualizations to aid in understanding data; build mathematical models
using the data; and present and communicate the data insights/findings.
A lot of Related Fields
Artificial Intelligence; focuses on intelligent behaviors (human copy; mimicking). More interested in
doing better than humans
Machine Learning; focuses on certain learning objectives wanting to achieve by programming different
functions and algorithms
Data mining; VR/Sensory, Medical as input
Information Retrieval; what you do when you type in something in Google (users are interested in it,
doing a search online – Siri fe.)
Natural Language Processing; deals with interpretational language, clever interpretation. Some of the
systems capture better information than we as humans do.
Computer Vision; focuses on the vision system we as humans have, processing images, getting
information from them, object classification within images.
Audio Signal Processing; deals with audio, speech, music.
Cognitive Sciences; deals with the brain specifically, processes of the brain (too broad)
Intelligent Games; where the agents in a game behave intelligently (with machine learning fe)
Agents (Biology); simulate certain agents against games, entities, animals can we make sense of their
behavior etc?
→ Know the minor differences!
ONE COMMONALITY: DATA-DRIVEN SCIENCE (ML/DM)
What is Data?
Example clouds: what does the weather have that is data? (Table with certain attributes)
What you can see is the outlook. What is the temperature, wind, can I play outside? YES/NO
• We want to use this data to predict something, make a classification
• Sunny or not, this kid doesn’t play outside when it’s sunny.
• Windy condition, not really a good point for play.
• There is not enough data to make a good interpretation
Convert the data into features
• Convert outlook into numbers.
o 1 = Sunny
o 0 = Cloudy
o 2 = Rainy
• Do the same for wind and temperature, where there is a mapping for values.
o 1 = yes
o 0 = no
o Binary representation, called attributes
• Features are the same as attributes!
Another measurements could be the degrees, temperature feelings, amount of rainfall and probability of
thunder.
• Scale is unknown cause it is degrees, percentage, km/h etc (units are different)
Another measurement could be with image data / combination with other data sources (photo of
clouds over map)
• Might combine it with other information such as ticket sales for a theme park.
This is INTERPRETING DATA!
Back to our data (of playing outside)
Can you come up with some rules for playing outside?
Conditions (Rules for prediction):
• If it’s sunny & hot → kid does not want to play outside.
• If it’s windy → kid does not want to play outside.
1
,• Then 1 rule left: if it’s not windy and not hot → kid wants to play.
We want to predict our target P L A Y given the features we have available.
FORMALLY!
• We have our data: X (with features: outlook, temp,
windy)
o Features can be continuous and discrete
o Continuous features: are real valued and can
be within some range.
o Discrete features: finite values and usually
associated with some label of category.
• Our data exists of smaller instances, ‘some instance’ is
written as: x.
• If we want to specifically point at a particular instance
(say our first row), we write: x1. We can see our model
as a function f, that when given any instance x, gives us
a prediction ŷ.
• The application of the model to some instance in our
data can be written as f(x).
• Our hope is that ŷ is the same as our target: y.
Quick Recap of Example
• Features: X (outlook, temp, windy)
• Targets: Y (play)
• Some instance: x
• Some target: y
• First column: x1 (sunny, hot, no)
• First target: y1 (no)
• Model: if it’s not windy and not hot → play (f)
• Predictions by f(x): ŷ
• Prediction for f(x1): ŷ1 (no)
Predictive Model (OR ALGORITHM)
def play predictor(data):
if data[“windy”] == ‘no’ and data[“temp”] != ‘hot’:
return ‘play’
else:
return ‘no play’
It’s sunny, mild, and windy…should I play? Realistic?
It will return ‘no play’, because the algorithm says it is NOT windy and HOT weather (not mild).
How do we know if our model performs well?
• Correct evaluation is incredibly important in Data Mining.
• We came up with some rules, be how do w know they generalize; if the rules we learned apply with
the same success rate to data where we don’t know what the target is.
Results of our model
• 5/6 correct, so our model has 83.3% accuracy.
• Did we cover all conditions?
• What if we are presented with new conditions?
• Rules are probably too strict.
• Other than the training data we determined our rules by, we also need test data; unseen by us, to
evaluate.
Explanation of unseen data: REALISTIC USE CASE
PREDICTING HOUSING PRICES (great example of data mining)
• Would you be able to determine the price of a house? → You need expert knowledge.
• Many observations required to gain experience (mental representation to know higher house price
fe)
• Features to predict the price of a house?
o Amount of bedrooms
o Big garden
o Good neighborhood
HOW TO EVALUATE?
2
,• Previously we had a clear binary (yes/no) prediction.
• Say we had more classes, we would still be predicting a nominal target (different from a numeric
target, where you really want to predict a price range. Order does not matter).
• We can’t say: we got … out of … correct, and therefore use accuracy.
• We are more likely interested in how far our prediction was off from the actual value: this is error.
TYPES OF PREDICTION
• Classes → classification (binary fe.)
• Values → regression
Complex information
• How would location affect price?
• How would pollution affect price?
• How about the good location but high pollution?
• Do you know how much of either would affect the price?
• Would one be able to easily craft a successful ruleset?
LEARNING TO PREDICT (some problems are very hard to solve for humans)
• Hand-made rules are not flexible
• Given more instances/observations, rules will become more complex, thus requiring better (more
complex) rules.
• Too much data becomes impossible to manually analyze.
• If done automatically, little expert knowledge is required; mostly data.
• Models can give information regarding underlying patterns and feature importances.
o If many rules mention location as a first condition to look at, that must be an important
feature.
You need good intuitions, domain expertise and get to know your data well (not just a bunch of
algorithms and you are “done”).
EXTRA MATERIAL
Quick discussion of:
• PC hardware and relation to data and algorithms
• Programming languages and their relation to above
This is not computer science, why do I need to know this>
• Algorithm choices often depend on hardware limitations.
• Some model families specifically deal with shortage of computation power.
• Different data types often relate to storage and processing.
• Certain terms are widespread throughout this course.
PC HARDWARE
Power supply (left corner)
CPU (processor)
HDD (storage, disks)
RAM (memory)
Motherboard (connects all the components)
HARD DRIVE (HDD/SSD)
Place where all your stuff is stored.
• Stores your files.
• HDD are larger (store more data, 1-5T) but slower (in reading/writing) and are fragile.
• SSDs are smaller (up to 1T), faster, more robust, but expensive.
• Most modern laptops come with an SSD
• For computations, algorithms/models read a particular set of data from you disks into memory.
MEMORY CHIP (RAM)
• Very fast reading/writing, but even more limited in space (8-16G up to 256G), very expensive.
• Algorithms can quickly access and manipulate data that is in memory.
• If memory limit is exceeded, computers usually freeze/processes slow down.
• Computations done on date in memory are commonly handled by the CPU.
PROCESSOR (CPU)
• Does computation part of a computer (berekeningen).
3
, • Can have multiple computation cores (duo core, quad core) to run operations in parallel (i.e.
simultaneously) which speeds up processes.
• The more expensive the CPU, the faster it does similar computations. The more cores, the faster it
runs parallel computations.
GRAPHICS CARD (GPU, special CPU)
• Some computations an be done on a GPU rather than the CPU.
• Commonly used for processing images or other visual content. Popular for video games.
• For ordinary systems, GPU is usually embedded in the CPU.
• GPU’s are very fast at ‘matrix operations’ and have therefore been popularized for Deep Learning
research (explained in future lectures).
• Has its own RAM (and therefore limitations).
Programming Languages
Python (this course): almost reads like English (high-level). C++ or Prolog are low-level (hard to read).
Representing Data (Week 1, Video 2)
Practical Lecture
LAST WEEK
• Data
• Features
• Algorithms
THIS WEEK
• Data
How do we get Data?
• Pre-mades: data sources that already have been compiled by people, clear prediction task, ideal to
work with as starting data scients
o Kaggle, UCI, Snap
• Dumps: big dumps of data
o IMDB, Reddit, MovieLens
• Scientific repositories: are always attached to some paper/research
o Dataverse
• (Web) API’s: common interfaces
o Twitter, Reddit
• Web scraping: where you do it yourself, select the fields you need. Becomes messy if you have to
use different websites cause every website has it’s own structure.
• At industry-level: databases (their own).
File Formats (3 main formats)
• CSV, comma separated values → flat data structure (table format), named columns and
rows. Separators and quotes.
• JSON → hierarchical (nested), lists and key/values. Widely used for API’s and document-
based databases (NOT FLAT)
• XML <data>, <movie id> → hierarchical, tags to name items. Very common standard; easy to
evaluate if according to some predefined structure.
• See Examples from Lecture Videos
What are Databases?
= Collections of hard disks. Internet connected machines that host a bunch of magnetic disks, that store
a lot of information. Because they are internet connected you can access them remotely. You would
connect to them via an IP address or URL.
https://mydatabase.com and will ask for a user + password. You will then be able to do queries to get
data from database to you. DBMS: Database Management System is a software between the PC and
the Database.
Databases are typically split in 2 types of different types of handling and structuring these queries:
Relational databases – Structured Query Language (SQL) databases
• Pre-defined, structured tables, relational → not easy to scale horizontally, but robust and well-
supported
• Ex.: MySQL, PostgreSQL, SQLite, MariaDB
Non-relational databases – NoSQL
4