Advanced data analysis
Contents
Chapter 1 - Introduction .................................................................................................................... 6
A bit of context .............................................................................................................................. 6
Introduction .............................................................................................................................. 6
Characteristics of big data .......................................................................................................... 6
But what is data ............................................................................................................................. 7
Objects and attributes ............................................................................................................... 7
Attribute types .......................................................................................................................... 8
Properties of attributes.............................................................................................................. 8
Discrete vs continuous attributes ............................................................................................... 8
Dataset types ................................................................................................................................ 9
Data mining ................................................................................................................................. 11
General.................................................................................................................................... 11
Is it data mining? ..................................................................................................................... 11
Data mining and statistics ........................................................................................................ 11
Data mining challenges ............................................................................................................ 12
Tasks ........................................................................................................................................... 13
General.................................................................................................................................... 13
Supervised ............................................................................................................................... 13
Unsupervised........................................................................................................................... 14
Data mining applications ............................................................................................................. 15
Overview ..................................................................................................................................... 15
Where are we with data mining now ........................................................................................... 15
Chapter 2 – Processing principles .................................................................................................... 16
Introduction ................................................................................................................................ 16
Unstructured data ................................................................................................................... 16
Common data processing steps ................................................................................................... 17
Overview ................................................................................................................................. 17
Feature extraction ................................................................................................................... 17
Attribute transformation ......................................................................................................... 17
Discretization........................................................................................................................... 18
Aggregation ............................................................................................................................. 18
1
, Noise removal.......................................................................................................................... 18
Identifying outliers................................................................................................................... 19
Sampling .................................................................................................................................. 19
Handling duplicate data ........................................................................................................... 20
Handling missing values ........................................................................................................... 20
Dimensionality reduction ......................................................................................................... 21
Processing steps for specific data types ....................................................................................... 22
Image data............................................................................................................................... 22
Survey data.............................................................................................................................. 23
Sequence data ......................................................................................................................... 23
Text data ................................................................................................................................. 24
Omics data............................................................................................................................... 25
Chapter 3 - Data mining – Unsupervised clustering .......................................................................... 31
Unsupervised vs supervised ..................................................................................................... 31
Introduction ................................................................................................................................ 31
Clustering ................................................................................................................................ 31
Similarity ................................................................................................................................. 32
Dendograms ............................................................................................................................ 34
Hierarchical clustering vs partitional clustering ........................................................................ 36
Hierarchical clustering ................................................................................................................. 36
General.................................................................................................................................... 36
Bottom-up ............................................................................................................................... 37
How do you calculate distance between already existing clusters ............................................ 37
Single linkage = nearest neighbour........................................................................................... 38
Complete linkage = Furthest neighbour ................................................................................... 39
Group average ......................................................................................................................... 39
Ward’s method ........................................................................................................................ 40
Comparison ............................................................................................................................. 40
Partitional clustering ................................................................................................................... 41
General.................................................................................................................................... 41
How many clusters?................................................................................................................. 41
How to tell right number of clusters? ....................................................................................... 41
Objective function: squared error ............................................................................................ 42
k-means steps.......................................................................................................................... 42
Importance of choosing initial centroids .................................................................................. 44
k-means limitations ................................................................................................................. 44
2
, k-means: conclusion ................................................................................................................ 45
Chapter 4 - Principal component analysis ........................................................................................ 46
Introduction ................................................................................................................................ 46
Principal component analysis ................................................................................................... 46
Multivariate data ..................................................................................................................... 46
Basic variable statistics ............................................................................................................ 46
Data transformation ................................................................................................................ 47
Comparison between variables ................................................................................................ 48
Still too many variables ............................................................................................................ 50
Data projection ........................................................................................................................ 50
PCA - Theory ................................................................................................................................ 51
Introduction ............................................................................................................................ 51
How PCA works........................................................................................................................ 51
PCA output .............................................................................................................................. 52
PCA summary .......................................................................................................................... 53
PCA usage ................................................................................................................................ 53
How many PC is enough to cover a data set? ........................................................................... 53
PCA - examples ............................................................................................................................ 54
Possum dataset ....................................................................................................................... 54
Nutrition dataset ..................................................................................................................... 56
B-cell receptor sequencing ....................................................................................................... 59
Metagenomics data ................................................................................................................. 60
t-SNE ........................................................................................................................................... 62
What is t-SNE? ......................................................................................................................... 62
How does t-SNE work?............................................................................................................. 62
PCA vs t-SNE ............................................................................................................................ 63
Perplexity ................................................................................................................................ 63
t-SNE for single cell RNAseq ..................................................................................................... 63
Chapter 5 - Supervised learning ....................................................................................................... 64
Classification problem ................................................................................................................. 64
Cat or dog problem .................................................................................................................. 64
Pigeon problem ....................................................................................................................... 64
Grasshopper problem .............................................................................................................. 64
Regression vs classification ...................................................................................................... 66
Linear classifier ............................................................................................................................ 66
Grasshopper example .............................................................................................................. 67
3
, Decision boundary ................................................................................................................... 67
Examples ................................................................................................................................. 68
Iris dataset ............................................................................................................................... 69
Support vector machine........................................................................................................... 69
Decision value.......................................................................................................................... 70
Classifier overview ................................................................................................................... 71
Estimating the performance of the classifier ................................................................................ 71
Predictive accuracy .................................................................................................................. 71
Class labels .............................................................................................................................. 72
Confusion matrix ..................................................................................................................... 72
Type I error vs type II error ...................................................................................................... 73
Values that can be acquired from confusion matrix ................................................................. 73
Thresholds and accuracy .......................................................................................................... 73
ROC-curve ............................................................................................................................... 75
PR curve – precision recall curve .............................................................................................. 76
ROC vs PR curves ..................................................................................................................... 76
Nearest Neighbour Classifier........................................................................................................ 77
Chapter 6 - Regression..................................................................................................................... 79
Introduction ................................................................................................................................ 79
Introductory example .............................................................................................................. 79
Classification vs regression....................................................................................................... 79
Simple linear regression............................................................................................................... 80
General.................................................................................................................................... 80
Multiple linear regression ............................................................................................................ 80
General.................................................................................................................................... 80
Best fit ..................................................................................................................................... 81
Objective function ................................................................................................................... 81
Evaluation................................................................................................................................ 82
Non-linear regression .................................................................................................................. 83
Logistic regression ................................................................................................................... 83
Overfitting ................................................................................................................................... 83
How do we estimate the capacity of our model to overfit? ...................................................... 84
K-fold cross validation – how do we estimate the accuracy of our model? ............................... 84
Factors to consider when building a model .................................................................................. 85
Speed and scalability ............................................................................................................... 85
Interpretability ........................................................................................................................ 86
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller lizaburdz. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $8.13. You're not tied to anything after your purchase.