Samenvatting
Summary Advanced Data Analysis class - For open book exam (content table with links to pages)
40 keer bekeken
0 keer verkocht
Summary of course Advanced Data Analysis made for the open book exam containing a content table with clickable links bringing you to the exact page. Description of all theory classes + notes made during classes.
[Meer zien]
Voorbeeld 4 van de 34 pagina's
Geupload op
4 december 2022
Aantal pagina's
34
Geschreven in
2021/2022
Type
Samenvatting
€15,49
100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast
Summary Advanced Data Analysis
Content table
1. Introduction ............................................................................................................ 5
Big data ............................................................................................................................. 5
Data volume .................................................................................................................... 5
Data velocity .................................................................................................................... 5
Data variety ..................................................................................................................... 5
Data veracity ................................................................................................................... 5
Data .................................................................................................................................. 5
Attribute values ................................................................................................................ 5
Attribute types ................................................................................................................. 5
Properties of attributes ...................................................................................................... 5
Discrete vs. Continuous ..................................................................................................... 5
Dataset types ..................................................................................................................... 6
Record data ..................................................................................................................... 6
Graph ............................................................................................................................. 6
Ordered data ................................................................................................................... 6
Data mining ........................................................................................................................ 6
Definitions ....................................................................................................................... 7
Statistics ...................................................................................................................... 7
Data mining & Statistics ................................................................................................. 7
Challenges in Data mining ................................................................................................. 7
Tasks ................................................................................................................................. 7
Supervised classification .................................................................................................... 7
Applications .................................................................................................................. 8
Unsupervised classification ................................................................................................ 8
Overview ............................................................................................................................ 8
2. Processing principles........................................................................... 9
Common steps .................................................................................................................... 9
Feature extraction ............................................................................................................ 9
Attribute transformation .................................................................................................... 9
Discretization ................................................................................................................... 9
Aggregation ..................................................................................................................... 9
Noise removal .................................................................................................................. 9
Outlier removal ................................................................................................................ 9
Sampling ......................................................................................................................... 9
Simple Random Sampling ............................................................................................... 9
Stratified Sampling ....................................................................................................... 10
Handling duplicate data .................................................................................................... 10
Handling missing values ................................................................................................... 10
1
, Dimensionality reduction .................................................................................................. 10
PCA ............................................................................................................................. 10
Feature subset selection ................................................................................................ 10
Feature creation ........................................................................................................... 11
Processing steps for specific data types ................................................................................. 11
Image data ..................................................................................................................... 11
Survey data .................................................................................................................... 11
Sequence data ................................................................................................................ 11
Text ............................................................................................................................... 12
Category/Ontologies ..................................................................................................... 12
Bag of words ................................................................................................................ 12
Omics ............................................................................................................................ 12
Genomics .................................................................................................................... 12
Transcriptomics ............................................................................................................ 12
Meta-genomics ............................................................................................................. 13
Proteomics ................................................................................................................... 13
Metabolomics ............................................................................................................... 14
Conclusion ......................................................................................................................... 14
3. Unsupervised clustering .................................................................... 15
Definitions ......................................................................................................................... 15
Introduction....................................................................................................................... 15
Clustering ....................................................................................................................... 15
Similarities ..................................................................................................................... 15
Distance measures ........................................................................................................ 15
Measure similarity......................................................................................................... 15
Dendrogram ................................................................................................................... 16
Hierarchical clustering ......................................................................................................... 16
Determination of distance ................................................................................................. 16
Partitional clustering ........................................................................................................... 17
4. Principal component analysis ............................................................ 18
Data & basic variable statistics ............................................................................................. 18
Multivariate data ............................................................................................................. 18
Basic variable statistics .................................................................................................... 18
Data transformation ......................................................................................................... 18
Normalization .................................................................................................................. 18
Comparison between variables ............................................................................................. 18
Covariance ..................................................................................................................... 18
Correlation...................................................................................................................... 18
Data projection .................................................................................................................. 19
Principal component analysis (PCA) ...................................................................................... 19
t-SNE................................................................................................................................ 20
2
,5. Supervised learning........................................................................... 22
Linear classifier .................................................................................................................. 22
Binary classification ............................................................................................................ 22
Support vector machines (SVMs) ....................................................................................... 23
Classification overview ..................................................................................................... 23
Predictive accuracy ............................................................................................................. 23
Class labels..................................................................................................................... 23
Thresholds and accuracy .................................................................................................. 24
Linear threshold ........................................................................................................... 24
ROC-curve ................................................................................................................... 24
PR curve ...................................................................................................................... 24
ROC vs PR curves ............................................................................................................ 24
Nearest neighbour classifier ................................................................................................. 25
K-nearest neighbour (KNN) algorithm ................................................................................ 25
6. Regression ........................................................................................ 26
Simple linear regression ...................................................................................................... 26
Multiple linear regression..................................................................................................... 26
Best fit & objective function ................................................................................................. 26
Non-linear regression.......................................................................................................... 27
Problems ........................................................................................................................... 27
Overfitting ...................................................................................................................... 27
Speed & scalability .......................................................................................................... 28
Interpretability ................................................................................................................ 28
Robustness ..................................................................................................................... 28
Regularized regression ........................................................................................................ 28
Elastic net ...................................................................................................................... 28
Common approach ............................................................................................................. 29
7. Machine learning methods................................................................. 30
Classification ..................................................................................................................... 30
Algorithms ...................................................................................................................... 30
Decision tree ..................................................................................................................... 30
Choosing features ............................................................................................................ 30
Gini impurity ................................................................................................................... 30
Advantages .................................................................................................................. 31
Disadvantages .............................................................................................................. 31
Example Decision Tree ..................................................................................................... 31
Random forest ................................................................................................................... 31
Bootstrapping ................................................................................................................. 31
Bagging.......................................................................................................................... 32
Out-of-bag performance ................................................................................................ 32
Gini importance ............................................................................................................... 32
3
, Example Random Forest ................................................................................................... 32
Neural networks & deep learning .......................................................................................... 32
Neurons ......................................................................................................................... 32
Neural network................................................................................................................ 33
Perceptron ................................................................................................................... 33
Artificial Neural Networks ................................................................................................. 33
Deep learning .................................................................................................................... 34
Performance ................................................................................................................... 34
Google DeepMind ............................................................................................................ 34
4