100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
All lectures notes - DSM €5,99
In winkelwagen

College aantekeningen

All lectures notes - DSM

1 beoordeling
 46 keer bekeken  2 keer verkocht

Data science lecture notes and R code examples for assignments

Voorbeeld 4 van de 51  pagina's

  • 14 januari 2022
  • 51
  • 2021/2022
  • College aantekeningen
  • Evert de haan, alec minnema, trynsje
  • Alle colleges
Alle documenten voor dit vak (7)

1  beoordeling

review-writer-avatar

Door: rutgervanbasten1 • 1 jaar geleden

No notes, just copied the slides

reply-writer-avatar

Door: georgie_vw • 1 jaar geleden

Thank you for your feedback, though i have transcribed everything the lecturers said plus written down extra information or definitions if i didnt understand something. So would be smart to read into the details :). Good luck on your exam.

avatar-seller
georgie_vw
Data science methods
Lecture 1
Part 1 – the basic in modelling
Data science process consists of 7 steps (we focus on the first five steps):




Criteria for a good model:
1. Simple
2. Evolutionary  start with basic model and make it more complex every
time.
3. Complete
4. Adaptive  Working for specific condition in which you want to investigate.
5. Robust. Useful in different situations and should hold in the future.




Part 2 Machine learning
What is machine learning?
Herbert Alexander Simon:
Learning is any process by which a system improves performance from
experience. Machine learning is concerned with computer programs that
automatically improve their performance through experience”.
IBM: machine learning is a branch of artificial intelligence and computer science
which focuses on the use of data and algorithms to imitate the way that humans
learn, gradually improving its accuracy.
 Stepwise process of improving.

Machine learning can be categorized in 3 types of models:

, 1. Supervised: Uses a training set, including both input and correct (e.g.
labeled) output, to teach models to yield the desired output. More images
of apples you show to algorithm, the algorithms start to notice that these
are apples. Machine can than determine if something is an apple or not.
2. Unsupervised: Identifies patterns in data sets containing data points that
are neither classified nor labeled. Give an algorithm a lot of data (e.g.
images) and the algorithm starts to find out which images are similar to
each other and which are less similar to each other.
3. Reinforcement: Enforces models (i.e. gives feedback or corrections) to
learn how to make decisions. Algorithm makes a prediction (e.g. this is an
apple) and then you say correct yes or no. If its incorrect, you give a
penalty, so it knows it made a mistake.

How supervised machine learning works:
- Step 1: provide the machine learning algorithm categorized or ‘labeled’
input and output data from to learn
- Step 2: feed the machine new, unlabeled information to see if it tags new
data appropriately, if not, continue refining the algorithm
Types of problems to which it’s suited: classification (sorting items into
categories). Regression (identifying real values like dollars, weight, etc.)

Machine learning techniques are very ‘hot and happening’:




Why machine learning?
- Develop systems that can automatically adapt and customize themselves
to individual users
 Personalized news, recommendation system (e.g. Amazon, Netflix),
e-mail filters
- Discover new knowledge from large databases (data mining)
 Market basket analysis (e.g. what products do consumers purchase
together?)
 Make predictions (e.g. which customers will churn and how will
revenue develop?)
- Ability to mimic human and replace certain monotonous tasks – which
require some intelligence
 Recognizing handwritten characters (fe. Based on partly typed out
and handwritten data combined will help for the training of ML).
 Categorize unstructured data
 Automated grading

, - Relatively fast and cheap

Statistics, machine learning and artificial intelligence
Statistics: Distinctions are fuzzy.
- Theory-based
- Focused on testing hypotheses
Machine learning
- Based on heuristics
- Focused on improving performance of a learning agent
Artificial intelligence
- Machines performing tasks that are characteristic of human intelligence.
Planning, understanding language, recognizing objects/sounds, learning,
problem solving
- Machine learning can be helpful in realizing this

Two key observations:
1. Machine learning is a container concept. Includes a lot of different types of
techniques.
2. Machine learning seems very ‘new’, but consists for an important part of
techniques that have a long tradition in different disciplines

Two types of e-mails:
1. Ham. Desired emails you want in your inbox
2. Spam. Emails that you don’t want in your inbox

The machine learning modelling process:
1. Create a training set
2. Train the algorithm/model
3. Score new dataset. Classifier/classification model

Different steps
Step 1: Create a training set
- Good training set: crucial for any machine learning project
- But is quite a pain… annoying, boring, time consuming and expensive
- Without a good training set:
 Algorithm not properly trained
 New data not well classified
- Minimum requirements:
 For supervised projects: reliable labeling of outputs
 Size: > 10 x number of inputs – more in complex (non-linear)
relationships
 No self-selection effects (e.g. communication by regular mail/email)
 Good representation of all phenomena that can occur
Three one-liners about training sets
- Garbage in, garbage out. If you have a bad data set, the predictions of the
model will be bad. The data has to be good that is put in
- Having better data often beats better algorithms
- Data silos are the enemy

Step 2: train the algorithm/model
- Goal of the algorithm: obtaining relevant insights from the training set
 Find systematic patterns in/relations between variables
- Examples

,  How can we use features of customers to predict fraud?
 Can we divide customers of InterAmerican into segments?
- Over the years, many algorithms/models have been developed

Illustration: learning to filter spam
- Spam – is all email the user doesn’t want to receive and has not asked to
receive
- Objective: identify spam emails
- Data (step 1): a database of emails
 Email type (‘spam’ or ‘ham’ – classified by users)
 Number of recipients
 Email length
 Country (based on IP)
 Customer type
 Wording
 Images (+host)
Algorithms tries to distinguish between ham and spam. E.g.:
- Classification based on number of recipients (Spam has more recipients
compared to ham).

Step 2: Learning from the training set – multiple inputs
- Classification of ‘spam’ and ‘ham’ can also be based on multiple inputs!
- E.g. the number of recipients and email length (number of inputs = 2).
- Classification can again be based on a line classify all emails north-west of
it as ‘ham’ classify all emails south-east of it as ‘spam’. Note that this is
not a regression line!
 In this case, any of these lines classifies the data perfectly
 Training the model means: finding the best line

Step 3: Classifying new data
- Once the model is trained in step 2 (i.e. when the best line is found), we
can use it to classify new data, which has unknown output
- We first place the new email in the space
- Subsequently, we classify it, based on the trained model, according to the
subspace in which it resides
- In this case, we should classify the new email as ‘spam’

Assessing the machine learning process
- So far: ideal situation – the 2 types of emails are perfectly separable by
simple line
- What if we had the case as on the next slide, where a straight line leads to
2 errors?
- To reduce the number of errors to zero, we would use a more complicated
model
- Which model is likely a better representation of the underlying
phenomenon?

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper georgie_vw. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 56326 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€5,99  2x  verkocht
  • (1)
In winkelwagen
Toegevoegd