Data science methods
Lecture 1
Part 1 – the basic in modelling
Data science process consists of 7 steps (we focus on the first five steps):
Criteria for a good model:
1. Simple
2. Evolutionary start with basic model and make it more complex every
time.
3. Complete
4. Adaptive Working for specific condition in which you want to investigate.
5. Robust. Useful in different situations and should hold in the future.
Part 2 Machine learning
What is machine learning?
Herbert Alexander Simon:
Learning is any process by which a system improves performance from
experience. Machine learning is concerned with computer programs that
automatically improve their performance through experience”.
IBM: machine learning is a branch of artificial intelligence and computer science
which focuses on the use of data and algorithms to imitate the way that humans
learn, gradually improving its accuracy.
Stepwise process of improving.
Machine learning can be categorized in 3 types of models:
, 1. Supervised: Uses a training set, including both input and correct (e.g.
labeled) output, to teach models to yield the desired output. More images
of apples you show to algorithm, the algorithms start to notice that these
are apples. Machine can than determine if something is an apple or not.
2. Unsupervised: Identifies patterns in data sets containing data points that
are neither classified nor labeled. Give an algorithm a lot of data (e.g.
images) and the algorithm starts to find out which images are similar to
each other and which are less similar to each other.
3. Reinforcement: Enforces models (i.e. gives feedback or corrections) to
learn how to make decisions. Algorithm makes a prediction (e.g. this is an
apple) and then you say correct yes or no. If its incorrect, you give a
penalty, so it knows it made a mistake.
How supervised machine learning works:
- Step 1: provide the machine learning algorithm categorized or ‘labeled’
input and output data from to learn
- Step 2: feed the machine new, unlabeled information to see if it tags new
data appropriately, if not, continue refining the algorithm
Types of problems to which it’s suited: classification (sorting items into
categories). Regression (identifying real values like dollars, weight, etc.)
Machine learning techniques are very ‘hot and happening’:
Why machine learning?
- Develop systems that can automatically adapt and customize themselves
to individual users
Personalized news, recommendation system (e.g. Amazon, Netflix),
e-mail filters
- Discover new knowledge from large databases (data mining)
Market basket analysis (e.g. what products do consumers purchase
together?)
Make predictions (e.g. which customers will churn and how will
revenue develop?)
- Ability to mimic human and replace certain monotonous tasks – which
require some intelligence
Recognizing handwritten characters (fe. Based on partly typed out
and handwritten data combined will help for the training of ML).
Categorize unstructured data
Automated grading
, - Relatively fast and cheap
Statistics, machine learning and artificial intelligence
Statistics: Distinctions are fuzzy.
- Theory-based
- Focused on testing hypotheses
Machine learning
- Based on heuristics
- Focused on improving performance of a learning agent
Artificial intelligence
- Machines performing tasks that are characteristic of human intelligence.
Planning, understanding language, recognizing objects/sounds, learning,
problem solving
- Machine learning can be helpful in realizing this
Two key observations:
1. Machine learning is a container concept. Includes a lot of different types of
techniques.
2. Machine learning seems very ‘new’, but consists for an important part of
techniques that have a long tradition in different disciplines
Two types of e-mails:
1. Ham. Desired emails you want in your inbox
2. Spam. Emails that you don’t want in your inbox
The machine learning modelling process:
1. Create a training set
2. Train the algorithm/model
3. Score new dataset. Classifier/classification model
Different steps
Step 1: Create a training set
- Good training set: crucial for any machine learning project
- But is quite a pain… annoying, boring, time consuming and expensive
- Without a good training set:
Algorithm not properly trained
New data not well classified
- Minimum requirements:
For supervised projects: reliable labeling of outputs
Size: > 10 x number of inputs – more in complex (non-linear)
relationships
No self-selection effects (e.g. communication by regular mail/email)
Good representation of all phenomena that can occur
Three one-liners about training sets
- Garbage in, garbage out. If you have a bad data set, the predictions of the
model will be bad. The data has to be good that is put in
- Having better data often beats better algorithms
- Data silos are the enemy
Step 2: train the algorithm/model
- Goal of the algorithm: obtaining relevant insights from the training set
Find systematic patterns in/relations between variables
- Examples
, How can we use features of customers to predict fraud?
Can we divide customers of InterAmerican into segments?
- Over the years, many algorithms/models have been developed
Illustration: learning to filter spam
- Spam – is all email the user doesn’t want to receive and has not asked to
receive
- Objective: identify spam emails
- Data (step 1): a database of emails
Email type (‘spam’ or ‘ham’ – classified by users)
Number of recipients
Email length
Country (based on IP)
Customer type
Wording
Images (+host)
Algorithms tries to distinguish between ham and spam. E.g.:
- Classification based on number of recipients (Spam has more recipients
compared to ham).
Step 2: Learning from the training set – multiple inputs
- Classification of ‘spam’ and ‘ham’ can also be based on multiple inputs!
- E.g. the number of recipients and email length (number of inputs = 2).
- Classification can again be based on a line classify all emails north-west of
it as ‘ham’ classify all emails south-east of it as ‘spam’. Note that this is
not a regression line!
In this case, any of these lines classifies the data perfectly
Training the model means: finding the best line
Step 3: Classifying new data
- Once the model is trained in step 2 (i.e. when the best line is found), we
can use it to classify new data, which has unknown output
- We first place the new email in the space
- Subsequently, we classify it, based on the trained model, according to the
subspace in which it resides
- In this case, we should classify the new email as ‘spam’
Assessing the machine learning process
- So far: ideal situation – the 2 types of emails are perfectly separable by
simple line
- What if we had the case as on the next slide, where a straight line leads to
2 errors?
- To reduce the number of errors to zero, we would use a more complicated
model
- Which model is likely a better representation of the underlying
phenomenon?