This document consists of college notes from the theory lessons supplemented with the explanatory figures and additional information. Therefore, it contains all theory that should be studied for the exam except the practicals.
Chapter 1: Introduction.................................................................................................................................... 4
1.1: Introduction .......................................................................................................................................... 4
o Before we start............................................................................................................................... 4
§ A few practical things ................................................................................................................. 4
® Background ........................................................................................................................... 4
o A bit of context............................................................................................................................... 4
§ Big data ..................................................................................................................................... 4
® Definition of big data ............................................................................................................. 5
® Big data is characterized by: .................................................................................................. 5
® Large scale data and AI brought a new data intensive research paradigm .............................. 8
§ What is data? Some definitions of what we are dealing with and how we can represent it?........ 8
® Data can be given by objects and attributes ........................................................................... 8
a) Data object....................................................................................................................... 9
b) Attribute .......................................................................................................................... 9
® Dataset types ...................................................................................................................... 10
a) Record:........................................................................................................................... 10
b) Graph: ............................................................................................................................ 11
c) Ordered:......................................................................................................................... 11
§ Data mining ............................................................................................................................. 12
® What is data mining? ........................................................................................................... 12
® Examples: Is it data mining?................................................................................................. 13
® Data mining challenges........................................................................................................ 13
® Major tasks of data mining (after preprocessing) ................................................................. 14
1) Supervised data mining ................................................................................................... 14
2) Unsupervised data mining ............................................................................................... 17
® Data mining is business ....................................................................................................... 18
® Value of data ....................................................................................................................... 19
® Evolution............................................................................................................................. 19
Chapter 2: Processing principles..................................................................................................................... 20
2.1: Processing principles............................................................................................................................ 20
o Introduction ................................................................................................................................. 20
§ What you usually have vs. what you want and need ................................................................. 20
® In reality you usually have ‘dirty data’ .................................................................................. 20
® Data that you actually want/need is: ................................................................................... 20
o Pre-processing and transformation à to get more minable data that can be further used ............ 20
§ Role of pre-processing and transformation............................................................................... 20
® Unstructured data ............................................................................................................... 20
® Common data processing steps that each make data more ready for data mining ................ 21
a) Feature extraction:......................................................................................................... 21
b) Attribute transformation = feature transformation ........................................................ 21
c) Discretization ................................................................................................................. 22
d) Aggregation.................................................................................................................... 22
e) Noise removal ................................................................................................................ 22
f) Identifying outliers à outlier removal ........................................................................... 23
g) Sampling ........................................................................................................................ 23
h) Handling duplicated data ............................................................................................... 24
i) Handling missing values ................................................................................................. 24
j) Dimensionality reduction ............................................................................................... 25
® Processing steps for specific data types: what types of features are we dealing with? .......... 29
1
, a) Image data: .................................................................................................................... 29
b) Survey data .................................................................................................................... 30
c) Sequence data................................................................................................................ 31
d) Text data ........................................................................................................................ 32
e) Omics data ..................................................................................................................... 32
f) Temporal........................................................................................................................ 38
Chapter 3: Unsupervised clustering................................................................................................................ 39
3.1: Unsupervised clustering ....................................................................................................................... 39
o Introduction ................................................................................................................................. 39
§ Unsupervised vs. supervised .................................................................................................... 39
® Quick overview in difference between supervised and unsupervised ................................... 39
§ Clustering ................................................................................................................................ 39
® What is clustering? .............................................................................................................. 39
® Exists in different domains and has different names but it does something quite similar ...... 39
® Natural grouping ................................................................................................................. 39
§ Similarity ................................................................................................................................. 40
® Wat is similarity? ................................................................................................................. 40
® Defining distance measures ................................................................................................. 40
® How do we measure similarity? ........................................................................................... 41
§ Dendrograms ........................................................................................................................... 42
® What is it? ........................................................................................................................... 42
® Example .............................................................................................................................. 42
® Use of dendrograms ............................................................................................................ 44
§ Algorithms ............................................................................................................................... 44
o 2 types of clustering ..................................................................................................................... 45
§ Hierarchical clustering ............................................................................................................. 45
® Principle: ............................................................................................................................. 45
® Heuristic search (= a more practical feasible way come up with the best dendrogram but
without forgetting that there are multiple options out there) ....................................................... 45
à Since we cannot test all possible trees we will have to heuristic search of all possible trees. We
could do this bottom-up or top-down. .......................................................................................... 45
à use a heuristic search à we cannot guarantee we get the optimal solution, but way faster than
testing every option ..................................................................................................................... 45
® How to measure the distance between 2 clusters based on the distance function? .............. 46
§ Partitional clustering ............................................................................................................... 50
® What is it? ........................................................................................................................... 50
® How many clusters? à how to specify k? ............................................................................ 50
® K-means steps (simple & efficient algorithm) ....................................................................... 51
® Importance of choosing initial centroids .............................................................................. 53
® Weakness of k-means.......................................................................................................... 53
Chapter 4: Principal component analysis (PCA) .............................................................................................. 54
4.1: Principal component analysis (PCA) ..................................................................................................... 54
o PCA as the backbone of modern data analysis .............................................................................. 54
§ What is principal component analysis and why is it necessary?................................................. 54
® PCA is the first thing you do when you get a new dataset..................................................... 54
® Reasons to do PCA:.............................................................................................................. 54
® Multivariate data................................................................................................................. 54
§ Important concepts.................................................................................................................. 55
® Basic variable statistics ........................................................................................................ 55
a) Mean .............................................................................................................................. 55
b) Median ........................................................................................................................... 56
c) Range ............................................................................................................................. 56
d) Variance ......................................................................................................................... 56
2
, e) Standard deviation.......................................................................................................... 56
® Data transformation ............................................................................................................ 56
2) Comparing variables ................................................................................................................. 57
o How does PCA work? .................................................................................................................... 58
§ Data projection ........................................................................................................................ 58
® Too many variables ............................................................................................................. 58
® What’s data projection? ...................................................................................................... 59
® Why use projections? .......................................................................................................... 59
® Data visualization and simplification à data projection should capture as much of the
information as possible ................................................................................................................ 60
® Geometric interpretation of PCA ......................................................................................... 60
® PCA output: IMPORTANT for the exam to interpret output ! ................................................ 62
® PCA usage: scores and loadings ........................................................................................... 64
® PCA examples...................................................................................................................... 64
§ t-SNE ..................................................................................... Fout! Bladwijzer niet gedefinieerd.
® = alternative method for data projection ............................................................................. 71
® How? .................................................................................................................................. 72
® Comparison PCA and t-SNE .................................................................................................. 74
® Perplexity ............................................................................................................................ 74
® Example: t-SNE for single cell RNAseq .................................................................................. 74
Chapter 5: Supervised learning ...................................................................................................................... 76
5.1: Supervised learning ............................................................................................................................. 76
o Introduction ................................................................................................................................. 76
§ Classification problem = problem we have a lot of experience with .......................................... 76
® Use features of an object to assign a hopefully correct label to an object ............................. 76
® Pigeon problems: training pigeons to classify paintings ........................................................ 76
® Grasshopper problem: Given a collection of annotated data. In this case 5 Katydids and 5
Grasshoppers, decide what type of insect the unlabeled example is (2 similar, but not identical
animals) ....................................................................................................................................... 76
o Regression vs. classification .......................................................................................................... 78
§ General.................................................................................................................................... 78
® Differences.......................................................................................................................... 78
§ Classification............................................................................................................................ 78
a) Simple linear classifier.................................................................................................... 78
® General: what is a simple linear classifier? ........................................................................... 78
® Support vector machines (SVM)........................................................................................... 82
® Decision value ..................................................................................................................... 83
® Predictive accuracy.............................................................................................................. 84
® Confusion matrix = matrix that fits all of the samples with the classified label vs. the true label
85
® Thresholds and accuracy ..................................................................................................... 86
® ROC and PR curves .............................................................................................................. 87
b) Nearest neighbor classifier ............................................................................................. 90
® What is this type of classifier? ............................................................................................. 90
Chapter 6: Regression .................................................................................................................................... 93
6.1: Regression ........................................................................................................................................... 93
o Regression = a supervised machine learning (ML) model and can be used to analyze multivariate
data (in data science you often need to deal with regression problems BUT this is different from ‘normal’
statistics) ............................................................................................................................................... 93
§ The regression problem ........................................................................................................... 93
® Given a collection of annotated data (in this case a number of insects with their ages), you
need to try to predict a variable about the data ............................................................................ 93
§ Regression vs. classification...................................................................................................... 94
3
, ® Classification....................................................................................................................... 94
® Regression .......................................................................................................................... 94
§ Types of regression .................................................................................................................. 94
® Simple linear regression...................................................................................................... 94
® Multiple linear regression ................................................................................................... 95
® Non-linear regression ......................................................................................................... 98
® Logistic regression .............................................................................................................. 98
® Cox regression .................................................................................................................... 99
® Regularized regression ...................................................................................................... 100
§ Considerations that need to be made with regression ............................................................ 103
® Overfitting......................................................................................................................... 103
- Intuitively we would say 9 ................................................................................................. 103
a) K-fold cross validation .................................................................................................. 104
b) Leave one-out cross validation (CV) = special case of K-fold cross validation when K =
number of samples ................................................................................................................ 105
® Speed and scalability ......................................................................................................... 105
® Interpretability à model interpretability is really important and leads to model transparency
105
® Robustness........................................................................................................................ 106
Chapter 7: Machine learning methods ......................................................................................................... 108
7.1: Machine learning methods ................................................................................................................ 108
o Supervised machine learning methods........................................................................................ 108
§ Recap .................................................................................................................................... 108
® Supervised vs. unsupervised .............................................................................................. 109
§ Classification.......................................................................................................................... 109
® Classification ..................................................................................................................... 109
® Classification algorithms .................................................................................................... 109
a) Support vector machines.............................................................................................. 110
b) Decision trees............................................................................................................... 110
c) Random forest ............................................................................................................. 114
d) Neural networks (NN) and deep learning ...................................................................... 119
e) K-nearest neighbors ..................................................... Fout! Bladwijzer niet gedefinieerd.
Chapter 1: Introduction
1.1: Introduction
• Introduction
o Before we start
§ A few practical things
® Background
¨ Background on bioinformatics, statistics, omics data analysis (NGS,
microarrays, …), data mining and machine learning
o A bit of context
§ Big data
® What is big data?
¨ In the last 5 decades there has been an evolution of the human system:
from seeing the human body from multi-disciplinary perspectives to the
human system as a complex interplay between genes, proteins, small
molecules, … that interact with each other in a very complex way and
4
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jentebeeldens1. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €18,49. Je zit daarna nergens aan vast.