100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Samenvatting - Advanced Data Analysis

Beoordeling
-
Verkocht
-
Pagina's
113
Geüpload op
19-10-2022
Geschreven in
2020/2021

- Introduction - Processing principles - Data mining - Principal component analysis - Supervised learning - Regression - Machine learning methods

Instelling
Vak











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
19 oktober 2022
Aantal pagina's
113
Geschreven in
2020/2021
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Advanced data analysis
Contents
Chapter 1 - Introduction .................................................................................................................... 6
A bit of context .............................................................................................................................. 6
Introduction .............................................................................................................................. 6
Characteristics of big data .......................................................................................................... 6
But what is data ............................................................................................................................. 7
Objects and attributes ............................................................................................................... 7
Attribute types .......................................................................................................................... 8
Properties of attributes.............................................................................................................. 8
Discrete vs continuous attributes ............................................................................................... 8
Dataset types ................................................................................................................................ 9
Data mining ................................................................................................................................. 11
General.................................................................................................................................... 11
Is it data mining? ..................................................................................................................... 11
Data mining and statistics ........................................................................................................ 11
Data mining challenges ............................................................................................................ 12
Tasks ........................................................................................................................................... 13
General.................................................................................................................................... 13
Supervised ............................................................................................................................... 13
Unsupervised........................................................................................................................... 14
Data mining applications ............................................................................................................. 15
Overview ..................................................................................................................................... 15
Where are we with data mining now ........................................................................................... 15
Chapter 2 – Processing principles .................................................................................................... 16
Introduction ................................................................................................................................ 16
Unstructured data ................................................................................................................... 16
Common data processing steps ................................................................................................... 17
Overview ................................................................................................................................. 17
Feature extraction ................................................................................................................... 17
Attribute transformation ......................................................................................................... 17
Discretization........................................................................................................................... 18
Aggregation ............................................................................................................................. 18

1

, Noise removal.......................................................................................................................... 18
Identifying outliers................................................................................................................... 19
Sampling .................................................................................................................................. 19
Handling duplicate data ........................................................................................................... 20
Handling missing values ........................................................................................................... 20
Dimensionality reduction ......................................................................................................... 21
Processing steps for specific data types ....................................................................................... 22
Image data............................................................................................................................... 22
Survey data.............................................................................................................................. 23
Sequence data ......................................................................................................................... 23
Text data ................................................................................................................................. 24
Omics data............................................................................................................................... 25
Chapter 3 - Data mining – Unsupervised clustering .......................................................................... 31
Unsupervised vs supervised ..................................................................................................... 31
Introduction ................................................................................................................................ 31
Clustering ................................................................................................................................ 31
Similarity ................................................................................................................................. 32
Dendograms ............................................................................................................................ 34
Hierarchical clustering vs partitional clustering ........................................................................ 36
Hierarchical clustering ................................................................................................................. 36
General.................................................................................................................................... 36
Bottom-up ............................................................................................................................... 37
How do you calculate distance between already existing clusters ............................................ 37
Single linkage = nearest neighbour........................................................................................... 38
Complete linkage = Furthest neighbour ................................................................................... 39
Group average ......................................................................................................................... 39
Ward’s method ........................................................................................................................ 40
Comparison ............................................................................................................................. 40
Partitional clustering ................................................................................................................... 41
General.................................................................................................................................... 41
How many clusters?................................................................................................................. 41
How to tell right number of clusters? ....................................................................................... 41
Objective function: squared error ............................................................................................ 42
k-means steps.......................................................................................................................... 42
Importance of choosing initial centroids .................................................................................. 44
k-means limitations ................................................................................................................. 44

2

, k-means: conclusion ................................................................................................................ 45
Chapter 4 - Principal component analysis ........................................................................................ 46
Introduction ................................................................................................................................ 46
Principal component analysis ................................................................................................... 46
Multivariate data ..................................................................................................................... 46
Basic variable statistics ............................................................................................................ 46
Data transformation ................................................................................................................ 47
Comparison between variables ................................................................................................ 48
Still too many variables ............................................................................................................ 50
Data projection ........................................................................................................................ 50
PCA - Theory ................................................................................................................................ 51
Introduction ............................................................................................................................ 51
How PCA works........................................................................................................................ 51
PCA output .............................................................................................................................. 52
PCA summary .......................................................................................................................... 53
PCA usage ................................................................................................................................ 53
How many PC is enough to cover a data set? ........................................................................... 53
PCA - examples ............................................................................................................................ 54
Possum dataset ....................................................................................................................... 54
Nutrition dataset ..................................................................................................................... 56
B-cell receptor sequencing ....................................................................................................... 59
Metagenomics data ................................................................................................................. 60
t-SNE ........................................................................................................................................... 62
What is t-SNE? ......................................................................................................................... 62
How does t-SNE work?............................................................................................................. 62
PCA vs t-SNE ............................................................................................................................ 63
Perplexity ................................................................................................................................ 63
t-SNE for single cell RNAseq ..................................................................................................... 63
Chapter 5 - Supervised learning ....................................................................................................... 64
Classification problem ................................................................................................................. 64
Cat or dog problem .................................................................................................................. 64
Pigeon problem ....................................................................................................................... 64
Grasshopper problem .............................................................................................................. 64
Regression vs classification ...................................................................................................... 66
Linear classifier ............................................................................................................................ 66
Grasshopper example .............................................................................................................. 67

3

, Decision boundary ................................................................................................................... 67
Examples ................................................................................................................................. 68
Iris dataset ............................................................................................................................... 69
Support vector machine........................................................................................................... 69
Decision value.......................................................................................................................... 70
Classifier overview ................................................................................................................... 71
Estimating the performance of the classifier ................................................................................ 71
Predictive accuracy .................................................................................................................. 71
Class labels .............................................................................................................................. 72
Confusion matrix ..................................................................................................................... 72
Type I error vs type II error ...................................................................................................... 73
Values that can be acquired from confusion matrix ................................................................. 73
Thresholds and accuracy .......................................................................................................... 73
ROC-curve ............................................................................................................................... 75
PR curve – precision recall curve .............................................................................................. 76
ROC vs PR curves ..................................................................................................................... 76
Nearest Neighbour Classifier........................................................................................................ 77
Chapter 6 - Regression..................................................................................................................... 79
Introduction ................................................................................................................................ 79
Introductory example .............................................................................................................. 79
Classification vs regression....................................................................................................... 79
Simple linear regression............................................................................................................... 80
General.................................................................................................................................... 80
Multiple linear regression ............................................................................................................ 80
General.................................................................................................................................... 80
Best fit ..................................................................................................................................... 81
Objective function ................................................................................................................... 81
Evaluation................................................................................................................................ 82
Non-linear regression .................................................................................................................. 83
Logistic regression ................................................................................................................... 83
Overfitting ................................................................................................................................... 83
How do we estimate the capacity of our model to overfit? ...................................................... 84
K-fold cross validation – how do we estimate the accuracy of our model? ............................... 84
Factors to consider when building a model .................................................................................. 85
Speed and scalability ............................................................................................................... 85
Interpretability ........................................................................................................................ 86

4

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
lizaburdz Universiteit Antwerpen
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
50
Lid sinds
8 jaar
Aantal volgers
34
Documenten
16
Laatst verkocht
2 maanden geleden

3.3

3 beoordelingen

5
1
4
1
3
0
2
0
1
1

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen