Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien
logo-home
Notes Advanced Data Analysis €18,49
Ajouter au panier

Notes de cours

Notes Advanced Data Analysis

 64 vues  2 fois vendu

This document consists of college notes from the theory lessons supplemented with the explanatory figures and additional information. Therefore, it contains all theory that should be studied for the exam except the practicals.

Aperçu 4 sur 131  pages

  • 19 février 2024
  • 131
  • 2022/2023
  • Notes de cours
  • Kris laukens
  • Toutes les classes
Tous les documents sur ce sujet (19)
avatar-seller
jentebeeldens1
Inhoudsopgave

Chapter 1: Introduction.................................................................................................................................... 4
1.1: Introduction .......................................................................................................................................... 4
o Before we start............................................................................................................................... 4
§ A few practical things ................................................................................................................. 4
® Background ........................................................................................................................... 4
o A bit of context............................................................................................................................... 4
§ Big data ..................................................................................................................................... 4
® Definition of big data ............................................................................................................. 5
® Big data is characterized by: .................................................................................................. 5
® Large scale data and AI brought a new data intensive research paradigm .............................. 8
§ What is data? Some definitions of what we are dealing with and how we can represent it?........ 8
® Data can be given by objects and attributes ........................................................................... 8
a) Data object....................................................................................................................... 9
b) Attribute .......................................................................................................................... 9
® Dataset types ...................................................................................................................... 10
a) Record:........................................................................................................................... 10
b) Graph: ............................................................................................................................ 11
c) Ordered:......................................................................................................................... 11
§ Data mining ............................................................................................................................. 12
® What is data mining? ........................................................................................................... 12
® Examples: Is it data mining?................................................................................................. 13
® Data mining challenges........................................................................................................ 13
® Major tasks of data mining (after preprocessing) ................................................................. 14
1) Supervised data mining ................................................................................................... 14
2) Unsupervised data mining ............................................................................................... 17
® Data mining is business ....................................................................................................... 18
® Value of data ....................................................................................................................... 19
® Evolution............................................................................................................................. 19

Chapter 2: Processing principles..................................................................................................................... 20
2.1: Processing principles............................................................................................................................ 20
o Introduction ................................................................................................................................. 20
§ What you usually have vs. what you want and need ................................................................. 20
® In reality you usually have ‘dirty data’ .................................................................................. 20
® Data that you actually want/need is: ................................................................................... 20
o Pre-processing and transformation à to get more minable data that can be further used ............ 20
§ Role of pre-processing and transformation............................................................................... 20
® Unstructured data ............................................................................................................... 20
® Common data processing steps that each make data more ready for data mining ................ 21
a) Feature extraction:......................................................................................................... 21
b) Attribute transformation = feature transformation ........................................................ 21
c) Discretization ................................................................................................................. 22
d) Aggregation.................................................................................................................... 22
e) Noise removal ................................................................................................................ 22
f) Identifying outliers à outlier removal ........................................................................... 23
g) Sampling ........................................................................................................................ 23
h) Handling duplicated data ............................................................................................... 24
i) Handling missing values ................................................................................................. 24
j) Dimensionality reduction ............................................................................................... 25
® Processing steps for specific data types: what types of features are we dealing with? .......... 29



1

, a) Image data: .................................................................................................................... 29
b) Survey data .................................................................................................................... 30
c) Sequence data................................................................................................................ 31
d) Text data ........................................................................................................................ 32
e) Omics data ..................................................................................................................... 32
f) Temporal........................................................................................................................ 38

Chapter 3: Unsupervised clustering................................................................................................................ 39
3.1: Unsupervised clustering ....................................................................................................................... 39
o Introduction ................................................................................................................................. 39
§ Unsupervised vs. supervised .................................................................................................... 39
® Quick overview in difference between supervised and unsupervised ................................... 39
§ Clustering ................................................................................................................................ 39
® What is clustering? .............................................................................................................. 39
® Exists in different domains and has different names but it does something quite similar ...... 39
® Natural grouping ................................................................................................................. 39
§ Similarity ................................................................................................................................. 40
® Wat is similarity? ................................................................................................................. 40
® Defining distance measures ................................................................................................. 40
® How do we measure similarity? ........................................................................................... 41
§ Dendrograms ........................................................................................................................... 42
® What is it? ........................................................................................................................... 42
® Example .............................................................................................................................. 42
® Use of dendrograms ............................................................................................................ 44
§ Algorithms ............................................................................................................................... 44
o 2 types of clustering ..................................................................................................................... 45
§ Hierarchical clustering ............................................................................................................. 45
® Principle: ............................................................................................................................. 45
® Heuristic search (= a more practical feasible way come up with the best dendrogram but
without forgetting that there are multiple options out there) ....................................................... 45
à Since we cannot test all possible trees we will have to heuristic search of all possible trees. We
could do this bottom-up or top-down. .......................................................................................... 45
à use a heuristic search à we cannot guarantee we get the optimal solution, but way faster than
testing every option ..................................................................................................................... 45
® How to measure the distance between 2 clusters based on the distance function? .............. 46
§ Partitional clustering ............................................................................................................... 50
® What is it? ........................................................................................................................... 50
® How many clusters? à how to specify k? ............................................................................ 50
® K-means steps (simple & efficient algorithm) ....................................................................... 51
® Importance of choosing initial centroids .............................................................................. 53
® Weakness of k-means.......................................................................................................... 53

Chapter 4: Principal component analysis (PCA) .............................................................................................. 54
4.1: Principal component analysis (PCA) ..................................................................................................... 54
o PCA as the backbone of modern data analysis .............................................................................. 54
§ What is principal component analysis and why is it necessary?................................................. 54
® PCA is the first thing you do when you get a new dataset..................................................... 54
® Reasons to do PCA:.............................................................................................................. 54
® Multivariate data................................................................................................................. 54
§ Important concepts.................................................................................................................. 55
® Basic variable statistics ........................................................................................................ 55
a) Mean .............................................................................................................................. 55
b) Median ........................................................................................................................... 56
c) Range ............................................................................................................................. 56
d) Variance ......................................................................................................................... 56


2

, e) Standard deviation.......................................................................................................... 56
® Data transformation ............................................................................................................ 56
2) Comparing variables ................................................................................................................. 57
o How does PCA work? .................................................................................................................... 58
§ Data projection ........................................................................................................................ 58
® Too many variables ............................................................................................................. 58
® What’s data projection? ...................................................................................................... 59
® Why use projections? .......................................................................................................... 59
® Data visualization and simplification à data projection should capture as much of the
information as possible ................................................................................................................ 60
® Geometric interpretation of PCA ......................................................................................... 60
® PCA output: IMPORTANT for the exam to interpret output ! ................................................ 62
® PCA usage: scores and loadings ........................................................................................... 64
® PCA examples...................................................................................................................... 64
§ t-SNE ..................................................................................... Fout! Bladwijzer niet gedefinieerd.
® = alternative method for data projection ............................................................................. 71
® How? .................................................................................................................................. 72
® Comparison PCA and t-SNE .................................................................................................. 74
® Perplexity ............................................................................................................................ 74
® Example: t-SNE for single cell RNAseq .................................................................................. 74

Chapter 5: Supervised learning ...................................................................................................................... 76
5.1: Supervised learning ............................................................................................................................. 76
o Introduction ................................................................................................................................. 76
§ Classification problem = problem we have a lot of experience with .......................................... 76
® Use features of an object to assign a hopefully correct label to an object ............................. 76
® Pigeon problems: training pigeons to classify paintings ........................................................ 76
® Grasshopper problem: Given a collection of annotated data. In this case 5 Katydids and 5
Grasshoppers, decide what type of insect the unlabeled example is (2 similar, but not identical
animals) ....................................................................................................................................... 76
o Regression vs. classification .......................................................................................................... 78
§ General.................................................................................................................................... 78
® Differences.......................................................................................................................... 78
§ Classification............................................................................................................................ 78
a) Simple linear classifier.................................................................................................... 78
® General: what is a simple linear classifier? ........................................................................... 78
® Support vector machines (SVM)........................................................................................... 82
® Decision value ..................................................................................................................... 83
® Predictive accuracy.............................................................................................................. 84
® Confusion matrix = matrix that fits all of the samples with the classified label vs. the true label
85
® Thresholds and accuracy ..................................................................................................... 86
® ROC and PR curves .............................................................................................................. 87
b) Nearest neighbor classifier ............................................................................................. 90
® What is this type of classifier? ............................................................................................. 90

Chapter 6: Regression .................................................................................................................................... 93
6.1: Regression ........................................................................................................................................... 93
o Regression = a supervised machine learning (ML) model and can be used to analyze multivariate
data (in data science you often need to deal with regression problems BUT this is different from ‘normal’
statistics) ............................................................................................................................................... 93
§ The regression problem ........................................................................................................... 93
® Given a collection of annotated data (in this case a number of insects with their ages), you
need to try to predict a variable about the data ............................................................................ 93
§ Regression vs. classification...................................................................................................... 94


3

, ® Classification....................................................................................................................... 94
® Regression .......................................................................................................................... 94
§ Types of regression .................................................................................................................. 94
® Simple linear regression...................................................................................................... 94
® Multiple linear regression ................................................................................................... 95
® Non-linear regression ......................................................................................................... 98
® Logistic regression .............................................................................................................. 98
® Cox regression .................................................................................................................... 99
® Regularized regression ...................................................................................................... 100
§ Considerations that need to be made with regression ............................................................ 103
® Overfitting......................................................................................................................... 103
- Intuitively we would say 9 ................................................................................................. 103
a) K-fold cross validation .................................................................................................. 104
b) Leave one-out cross validation (CV) = special case of K-fold cross validation when K =
number of samples ................................................................................................................ 105
® Speed and scalability ......................................................................................................... 105
® Interpretability à model interpretability is really important and leads to model transparency
105
® Robustness........................................................................................................................ 106

Chapter 7: Machine learning methods ......................................................................................................... 108
7.1: Machine learning methods ................................................................................................................ 108
o Supervised machine learning methods........................................................................................ 108
§ Recap .................................................................................................................................... 108
® Supervised vs. unsupervised .............................................................................................. 109
§ Classification.......................................................................................................................... 109
® Classification ..................................................................................................................... 109
® Classification algorithms .................................................................................................... 109
a) Support vector machines.............................................................................................. 110
b) Decision trees............................................................................................................... 110
c) Random forest ............................................................................................................. 114
d) Neural networks (NN) and deep learning ...................................................................... 119
e) K-nearest neighbors ..................................................... Fout! Bladwijzer niet gedefinieerd.



Chapter 1: Introduction

1.1: Introduction
• Introduction
o Before we start
§ A few practical things
® Background
¨ Background on bioinformatics, statistics, omics data analysis (NGS,
microarrays, …), data mining and machine learning
o A bit of context
§ Big data
® What is big data?
¨ In the last 5 decades there has been an evolution of the human system:
from seeing the human body from multi-disciplinary perspectives to the
human system as a complex interplay between genes, proteins, small
molecules, … that interact with each other in a very complex way and



4

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur jentebeeldens1. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €18,49. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

58716 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 15 ans

Commencez à vendre!
€18,49  2x  vendu
  • (0)
Ajouter au panier
Ajouté