100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
All lectures notes - DSM $6.39
Add to cart

Class notes

All lectures notes - DSM

1 review
 46 views  2 purchases
  • Course
  • Institution

Data science lecture notes and R code examples for assignments

Preview 4 out of 51  pages

  • January 14, 2022
  • 51
  • 2021/2022
  • Class notes
  • Evert de haan, alec minnema, trynsje
  • All classes

1  review

review-writer-avatar

By: rutgervanbasten1 • 1 year ago

No notes, just copied the slides

reply-writer-avatar

By: georgie_vw • 1 year ago

Thank you for your feedback, though i have transcribed everything the lecturers said plus written down extra information or definitions if i didnt understand something. So would be smart to read into the details :). Good luck on your exam.

avatar-seller
Data science methods
Lecture 1
Part 1 – the basic in modelling
Data science process consists of 7 steps (we focus on the first five steps):




Criteria for a good model:
1. Simple
2. Evolutionary  start with basic model and make it more complex every
time.
3. Complete
4. Adaptive  Working for specific condition in which you want to investigate.
5. Robust. Useful in different situations and should hold in the future.




Part 2 Machine learning
What is machine learning?
Herbert Alexander Simon:
Learning is any process by which a system improves performance from
experience. Machine learning is concerned with computer programs that
automatically improve their performance through experience”.
IBM: machine learning is a branch of artificial intelligence and computer science
which focuses on the use of data and algorithms to imitate the way that humans
learn, gradually improving its accuracy.
 Stepwise process of improving.

Machine learning can be categorized in 3 types of models:

, 1. Supervised: Uses a training set, including both input and correct (e.g.
labeled) output, to teach models to yield the desired output. More images
of apples you show to algorithm, the algorithms start to notice that these
are apples. Machine can than determine if something is an apple or not.
2. Unsupervised: Identifies patterns in data sets containing data points that
are neither classified nor labeled. Give an algorithm a lot of data (e.g.
images) and the algorithm starts to find out which images are similar to
each other and which are less similar to each other.
3. Reinforcement: Enforces models (i.e. gives feedback or corrections) to
learn how to make decisions. Algorithm makes a prediction (e.g. this is an
apple) and then you say correct yes or no. If its incorrect, you give a
penalty, so it knows it made a mistake.

How supervised machine learning works:
- Step 1: provide the machine learning algorithm categorized or ‘labeled’
input and output data from to learn
- Step 2: feed the machine new, unlabeled information to see if it tags new
data appropriately, if not, continue refining the algorithm
Types of problems to which it’s suited: classification (sorting items into
categories). Regression (identifying real values like dollars, weight, etc.)

Machine learning techniques are very ‘hot and happening’:




Why machine learning?
- Develop systems that can automatically adapt and customize themselves
to individual users
 Personalized news, recommendation system (e.g. Amazon, Netflix),
e-mail filters
- Discover new knowledge from large databases (data mining)
 Market basket analysis (e.g. what products do consumers purchase
together?)
 Make predictions (e.g. which customers will churn and how will
revenue develop?)
- Ability to mimic human and replace certain monotonous tasks – which
require some intelligence
 Recognizing handwritten characters (fe. Based on partly typed out
and handwritten data combined will help for the training of ML).
 Categorize unstructured data
 Automated grading

, - Relatively fast and cheap

Statistics, machine learning and artificial intelligence
Statistics: Distinctions are fuzzy.
- Theory-based
- Focused on testing hypotheses
Machine learning
- Based on heuristics
- Focused on improving performance of a learning agent
Artificial intelligence
- Machines performing tasks that are characteristic of human intelligence.
Planning, understanding language, recognizing objects/sounds, learning,
problem solving
- Machine learning can be helpful in realizing this

Two key observations:
1. Machine learning is a container concept. Includes a lot of different types of
techniques.
2. Machine learning seems very ‘new’, but consists for an important part of
techniques that have a long tradition in different disciplines

Two types of e-mails:
1. Ham. Desired emails you want in your inbox
2. Spam. Emails that you don’t want in your inbox

The machine learning modelling process:
1. Create a training set
2. Train the algorithm/model
3. Score new dataset. Classifier/classification model

Different steps
Step 1: Create a training set
- Good training set: crucial for any machine learning project
- But is quite a pain… annoying, boring, time consuming and expensive
- Without a good training set:
 Algorithm not properly trained
 New data not well classified
- Minimum requirements:
 For supervised projects: reliable labeling of outputs
 Size: > 10 x number of inputs – more in complex (non-linear)
relationships
 No self-selection effects (e.g. communication by regular mail/email)
 Good representation of all phenomena that can occur
Three one-liners about training sets
- Garbage in, garbage out. If you have a bad data set, the predictions of the
model will be bad. The data has to be good that is put in
- Having better data often beats better algorithms
- Data silos are the enemy

Step 2: train the algorithm/model
- Goal of the algorithm: obtaining relevant insights from the training set
 Find systematic patterns in/relations between variables
- Examples

,  How can we use features of customers to predict fraud?
 Can we divide customers of InterAmerican into segments?
- Over the years, many algorithms/models have been developed

Illustration: learning to filter spam
- Spam – is all email the user doesn’t want to receive and has not asked to
receive
- Objective: identify spam emails
- Data (step 1): a database of emails
 Email type (‘spam’ or ‘ham’ – classified by users)
 Number of recipients
 Email length
 Country (based on IP)
 Customer type
 Wording
 Images (+host)
Algorithms tries to distinguish between ham and spam. E.g.:
- Classification based on number of recipients (Spam has more recipients
compared to ham).

Step 2: Learning from the training set – multiple inputs
- Classification of ‘spam’ and ‘ham’ can also be based on multiple inputs!
- E.g. the number of recipients and email length (number of inputs = 2).
- Classification can again be based on a line classify all emails north-west of
it as ‘ham’ classify all emails south-east of it as ‘spam’. Note that this is
not a regression line!
 In this case, any of these lines classifies the data perfectly
 Training the model means: finding the best line

Step 3: Classifying new data
- Once the model is trained in step 2 (i.e. when the best line is found), we
can use it to classify new data, which has unknown output
- We first place the new email in the space
- Subsequently, we classify it, based on the trained model, according to the
subspace in which it resides
- In this case, we should classify the new email as ‘spam’

Assessing the machine learning process
- So far: ideal situation – the 2 types of emails are perfectly separable by
simple line
- What if we had the case as on the next slide, where a straight line leads to
2 errors?
- To reduce the number of errors to zero, we would use a more complicated
model
- Which model is likely a better representation of the underlying
phenomenon?

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller georgie_vw. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $6.39. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

56326 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$6.39  2x  sold
  • (1)
Add to cart
Added