College aantekeningen

Cheat sheet for during exam

0 keer verkocht

Instelling
Technische Universiteit Eindhoven (TUE)

Sheet with notes for during the exam, it is allowed to use it during the exam as long as it is printed.

[Meer zien]

Voorbeeld 1 van de 2 pagina's

Bekijk voorbeeld

Geupload op 17 april 2022
Aantal pagina's 2
Geschreven in 2021/2022
Type College aantekeningen
Docent(en) X
Bevat Alle colleges

data visualization
exploratory data analysis
data mining methods
hypothesis testing
data organisation and queries
data aggregation and sampling

€6,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Categorical data (no intrinsic value) When not visualizing: individual precise ->strive for low within-cluster-distance
Nominal: outcomes that have no values matter, summary and detail values, (W(C)): sum of distances between centroids
..natural order (hair colour: blond,..) scale is broken, decision needed in min time and each observation
Dichotomous: nominal with 2 outcomes W(C) decreases when k increases
Ordinal: outcomes that have natural Effectiveness: how well visualization helps When is clustering informative:
..order (ratings: bad, good,.) person with their tasks K = 1: one cluster -> no
K = n: equal clusters and obs -> no
Numerical data (intrinsic value) Location summary statistics: level 1 < K < n -> yes
Continuous: any value on scale n ! no rule for k, but not minimizing W(C)
1
Interval: equal intervals > equal
…..differences, no fixed 0-point …..
Mean: ∑x
n i=1 i
Equal treatment of att is important:
Use same units for similar attributes
(temperature C, IQ, time) Median: odd obs: middle value when Ensure units used lead to relevant
Ratio: differences and ratio make ordered, even obs: avg two middle values ….distance for problem
…...sense, fixed 0-point (budget, …… Mode: most frequently occurring value Standardize units for dissimilar att
temperature K, distance) Distance linear regression:
Discrete: only certain values (number of..) Scale statistics: spread Residuals/deviations
Range: max – min Determines SSD, and so optimal
Location percentile P =1+(P/100)*(n-1) IQR: 3rd quartile – 1st quartile ….model
Pth perctl val = l+(LP-LPround,down)(h-l) n Basis of quality measure R
1
Lookup: know what and where
Sample var: ∑ (x −x )2
n−1 i=1 i
Distance clustering:
Distances determine clusters
Browse: don’t know what, know where Different attribute scales can be

√
n
Locate: know what, not where 1 ….chosen, influencing distance
Explore: don’t know what nor where Sample sd: ∑ (x −x)2
n−1 i=1 i
Basis of quality measure W(C)

Key attribute = independent attribute MAD: median of absolute deviation from Distance: measure for how close things
Value attribute = dependent attribute median are, how related things are, distances can
be easily compared, no single appropriate
Scatterplot: 2 quant. att., no keys only Sample covariance and correlation: relation distance
values, points, horiz, + vert. position, find 1 Euclidean distance: as the crow flies
trends, outliers, distribution, correlation, s xy = ( x −x )( y i− y ) and Network distance: know network of
clusters n−1 i ..possible movements, network is sparse ..
Bar chart: 1 cat. att. (key) + 1 quant. att. s xy (not too many possible roads)
(value), lines, length to express quant. r xy = Manhattan distance: movement is
value, spatial regions: one per mark, sx s y ..restricted to fixed grid
compare+look up values
Stacked bar chart: 2 cat. att., 1 quant. att, Categories data mining: Decision tree:
vertical stack of line marks, glyph: Predefined target? TP+TN
composite object, internal structure from Yes->supervised method Accuracy =
multiple marks, length and color hue, No->unsupervised method
TP+ TN+ FP+ FN
spatial regions: one per glyph, Info applicable to all of some data? Where to split: all-yes or all-no most
compare+look up values, part-to-whole All->global method informative, equal yes/no least informative;
relationship Some->local method lowest avg entropy:
Normalized stack. barchart: same as H ( p )=− p log 2 ( p )−(1− p) log 2 (1− p)
stacked bar chart, reduces comparability for DM methods:
all cat. except lowest and highest Lin regression: supervised, global
Line chart: 2 quant. att., 1 key, 1 value, Association rule learning:
Clustering: unsupervised, global
points, aligned lengths to express qual. val., Decision tree: supervised, global | X|
separated+ordered by key att into Support of itemset X: supp ( X )=
horizontal regions, find trend, connecting
Association rule learning: unsupervised, n
local
line emphasizes ordering of items along key Support of itemset X ∩ Y:
axis by showing relationship between to Linear regression: supp ( X ∩Y )=¿ X ∩Y ∨ ¿ ¿
items Consider residual y - ŷ betw. real value y n
Heatmap: 2 cat att, 1 quant att, area, and predicted value ŷ = b0 + b1x. Confidence of rule X => Y:
separate+align in 2D matrix, indexed by 2
cat values, color by quant att, find SSD = conf ( X =¿ Y )=¿ X ∩Y ∨ ¿ ¿
clusters+outliers Lower SSD -> better model ¿ X∨¿ ¿
Histogram: table, find distribution(shape),
new table: keys are bins, values are counts, Object system: ‘real’ world of a company,
bin size crucial, related to kernel density organization…
estimate and rug plot Information system: representation of real
Boxplot: table, find distribution(group Best values are world in a computer system using data to
comparison), 5 quant att, median: central represent objects
line, lower+upper quartiles: boxes,
lower+upper fences: whiskers, first quartile Not storing all data in one table: duplication
-1.5IQR, third quartile +1.5IQR, outliers of information, difficulty keeping information
beyond fence shown Higher R2 -> better model consistent, difficulty accessing+sharing
Violin plot: same as boxplot, outliers are data, hard to keep data safe/secure, hard to
represented in density plot Clustering: express interesting analytics
Bar vs line chart: depends on type key att: Centroids represent clusters.
bar if key=cat(nominal), line if key=ordered, K-means clustering algorithm: Database management systems
never line for categorical key: violates Pick k points as centroids (DBMS’s) provide solutions: data
expressiveness principle +trend so strong it Assign points to nearest centroid redundancy+inconsistency, data security,
overrides semantics Recompute centroids: mean of points in efficient data analytics
Box vs violin plot: boxplots hide essential ….cluster
aspects of dataset, violin plots better for Repeat steps 2 and 3 Primary key = unique identifier
representing differences in distribution of How well does centroid represent cluster:
data small distance->good, large distance->bad Logical schema (data model) – logical
structure of database:

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jbtue. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 69411 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis