Samenvatting

Samenvatting theoriecursus advanced data analysis

14 keer bekeken 0 keer verkocht

Vak
Advanced Data Analysis

Instelling
Universiteit Antwerpen (UA)

samenvatting gebaseerd op de powerpoints en eigen notities

[Meer zien]

Voorbeeld 4 van de 54 pagina's

Bekijk voorbeeld

Geupload op 6 juli 2023
Aantal pagina's 54
Geschreven in 2022/2023
Type Samenvatting

Volgen

lauraheyndrickx Lid sinds 5 jaar 33 documenten verkocht

€15,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Contents
What is data...........................................................................................................................................3
Data mining............................................................................................................................................4
Tasks...................................................................................................................................................5
Common data processing steps..............................................................................................................7
Processing steps for specific data types...............................................................................................10
Meta-genomics.....................................................................................................................................12
Introduction......................................................................................................................................12
Binning..............................................................................................................................................13
Downstream analysis........................................................................................................................13
Introduction..........................................................................................................................................14
Clustering..........................................................................................................................................14
Similarity...........................................................................................................................................14
Dendrograms....................................................................................................................................14
Hierarchical clustering..........................................................................................................................15
Single linkage, complete linkage, group average, centroid, Ward’s method....................................16
Partitional clustering............................................................................................................................17
k-means............................................................................................................................................18
Multivariate data..................................................................................................................................20
Data transformation.........................................................................................................................20
Normalization...............................................................................................................................20
Comparison between variables............................................................................................................21
Data projection.................................................................................................................................21
Working mechanism of PCA.................................................................................................................22
PCA loadings.....................................................................................................................................24
PCA scores........................................................................................................................................24
Scree plot..........................................................................................................................................25
Example case........................................................................................................................................25
possum dataset................................................................................................................................25
Nutrition data...................................................................................................................................27
t-SNE.....................................................................................................................................................30
Linear classifier.....................................................................................................................................32
Multiclass classification problem......................................................................................................33
Confusion matrix..............................................................................................................................36
Nearest neighbor classifier...................................................................................................................37

1

,Linear regression..................................................................................................................................38
Simple linear regression...................................................................................................................38
Multiple linear regression.................................................................................................................38
Non-linear regression...........................................................................................................................40
Logistic regression............................................................................................................................40
Cox regression..................................................................................................................................41
K-fold cross validation..........................................................................................................................41
Feature selection..............................................................................................................................43
Regularized regression..................................................................................................................43
Decision tree and random forests........................................................................................................46
Random forest..................................................................................................................................49
Neural networks...................................................................................................................................51
deep learning....................................................................................................................................53
Examples..............................................................................................................................................53

2

,Introduction
What is data
 There were hypes in AI before, but recent hype is very important
o Something is happening that will transform the way of working
 Data = collection of objects (items that you study) and their attributes
o Attribute = variable = feature = property or characteristic of an object
 Eg: eye color, temperature…
 A collection of attributes describes an object
 Attribute values are numbers or symbols assigned to an attribute
 Same attribute can be mapped to different attribute values
(eg: height can be measured in feet or meters)
 Different attributes can be mapped to the same set of values
o However properties of attribute values can still be different
 Different types possible
 Nominal (like zip-codes, eye color…(=)), ordinal (rankings(<>)),
interval (calender dates, temperatures in Celsius(+-)) and ratios (*/)
 Discrete (countable infinite set of values) of continuous (real
numbers -> only finite number of digits possible)
 Big data = data for which conventional computer-techniques are not sufficient anymore due
to size, complexity…
o Even though computers get more powerful, conventional ways of building a program
to solve the problem will not be sufficient enough
 A lot of things that worked for small data, did not work anymore on larger
datasets
o Characterized by the four V’s
 Volume: drastically increasing volume of data
 Eg: costs of reading out a full human genome has decreased
drastically when new sequence-techniques and computer
programmes were developed
o Now 1000 dollar for whole genome instead of 100.000.000
dollar -> within reach for medical applications
o Also important is that patents have expired -> prediction is
that reading of genome will even get cheaper
 Velocity: the way all data is received
 While you are waiting on previous data, more data is coming
towards you -> instant solutions needed
o Smartphones and -watches are measuring all the time
 Not to bad to wait longer for certain information
o In hospital environment we need our data immediately
 sometimes the data gets in so fast, that you don’t have enough
people to analyse this data = data management gap
 need for new, effective, high-tech data transfer approach
o amount of data that has to be transferred is so much, that it
is more efficient to put it on harddrives and bring it by bike
to other places all over the world

3

, amount of wifi needed could be very high so that in
certain places in the world it isn’t a possibility
 Variety: heterogeneity of data
 No single type of data, but very diverse and a lot of background
information is needed to understand the data
o A lot of data is unstructured (not organized in matrices)
-> context information is very important to understand the
amount of data
 Veracity: how trustable is all the data?
 Because of large scale data and AI a new data intensive research paradigm has formed
 Different dataset types
o Record = data consisting of collection of records, each which consists of a fixed set of
attributes
 If data objects have the same fixed set of numeric attributes, then the data
can be thought of as points in a multi-dimensional space
 Every dimension represents a distinct attribute
 Represented by an m by n matrix
o M (rows) are objects and n (columns) are attributes
o Number of dimensions is dependent on the number of
attributes
 Each document becomes a ‘term’ vector
 Each term is a component (attribute) of the vector
o The value of each component is the number of times the
corresponding term occurs in the document
-> classification of documents is possible
 Transaction data as a special type of record data
 Each record (transaction) involves a set of items
o Eg: supermarket basket: each record is consisting of a set of
items (0-many), but also every record has a different amount
and combination of products
o Graph =way to represent complex information in a network
o Ordered = data that follows a specific order, where the order is essential for the
information
 Eg: DNA sequence: Not only the amount of letters is important, but also the
places where the letters will appear

Data mining1
 Non-trivial extraction of implicit, previously unknown and potentially useful information from
a dataset
o Automatic or semi-automatic means could discover meaningful patterns in large
quantities of data -> converting extracted information into useful knowledge
 The data is already there, but an explicit answer to a question isn’t
o Eg: which names are more prevalent in certain locations, personalized book
recommendations, prioritize genes based on text co-occurrences

1
Data mining is the process of sorting through large data sets to identify patterns and relationships that can
help solve business problems through data analysis

4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

√ Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper lauraheyndrickx. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €15,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 58716 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire universiteiten

Populaire hogescholen

Populaire studieboeken voor Communicatie en Taal

Populaire studieboeken voor Economie en Bedrijf

Populaire studieboeken voor Exact en Informatica

Populaire studieboeken voor Gedrag en Maatschappij

Populaire studieboeken voor Gezondheid en Geneeskunde

Populaire studieboeken voor Recht en Bestuur

Verkoper