100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Cheat sheet for during exam €6,49   In winkelwagen

College aantekeningen

Cheat sheet for during exam

 9 keer bekeken  0 keer verkocht

Sheet with notes for during the exam, it is allowed to use it during the exam as long as it is printed.

Voorbeeld 1 van de 2  pagina's

  • 17 april 2022
  • 2
  • 2021/2022
  • College aantekeningen
  • X
  • Alle colleges
Alle documenten voor dit vak (1)
avatar-seller
jbtue
Categorical data (no intrinsic value) When not visualizing: individual precise ->strive for low within-cluster-distance
Nominal: outcomes that have no values matter, summary and detail values, (W(C)): sum of distances between centroids
..natural order (hair colour: blond,..) scale is broken, decision needed in min time and each observation
Dichotomous: nominal with 2 outcomes W(C) decreases when k increases
Ordinal: outcomes that have natural Effectiveness: how well visualization helps When is clustering informative:
..order (ratings: bad, good,.) person with their tasks K = 1: one cluster -> no
K = n: equal clusters and obs -> no
Numerical data (intrinsic value) Location summary statistics: level 1 < K < n -> yes
Continuous: any value on scale n ! no rule for k, but not minimizing W(C)
1
Interval: equal intervals > equal
…..differences, no fixed 0-point …..
Mean: ∑x
n i=1 i
Equal treatment of att is important:
Use same units for similar attributes
(temperature C, IQ, time) Median: odd obs: middle value when Ensure units used lead to relevant
Ratio: differences and ratio make ordered, even obs: avg two middle values ….distance for problem
…...sense, fixed 0-point (budget, …… Mode: most frequently occurring value Standardize units for dissimilar att
temperature K, distance) Distance linear regression:
Discrete: only certain values (number of..) Scale statistics: spread Residuals/deviations
Range: max – min Determines SSD, and so optimal
Location percentile P =1+(P/100)*(n-1) IQR: 3rd quartile – 1st quartile ….model
Pth perctl val = l+(LP-LPround,down)(h-l) n Basis of quality measure R
1
Lookup: know what and where
Sample var: ∑ (x −x )2
n−1 i=1 i
Distance clustering:
Distances determine clusters
Browse: don’t know what, know where Different attribute scales can be



n
Locate: know what, not where 1 ….chosen, influencing distance
Explore: don’t know what nor where Sample sd: ∑ (x −x)2
n−1 i=1 i
Basis of quality measure W(C)

Key attribute = independent attribute MAD: median of absolute deviation from Distance: measure for how close things
Value attribute = dependent attribute median are, how related things are, distances can
be easily compared, no single appropriate
Scatterplot: 2 quant. att., no keys only Sample covariance and correlation: relation distance
values, points, horiz, + vert. position, find 1 Euclidean distance: as the crow flies
trends, outliers, distribution, correlation, s xy = ( x −x )( y i− y ) and Network distance: know network of
clusters n−1 i ..possible movements, network is sparse ..
Bar chart: 1 cat. att. (key) + 1 quant. att. s xy (not too many possible roads)
(value), lines, length to express quant. r xy = Manhattan distance: movement is
value, spatial regions: one per mark, sx s y ..restricted to fixed grid
compare+look up values
Stacked bar chart: 2 cat. att., 1 quant. att, Categories data mining: Decision tree:
vertical stack of line marks, glyph: Predefined target? TP+TN
composite object, internal structure from Yes->supervised method Accuracy =
multiple marks, length and color hue, No->unsupervised method
TP+ TN+ FP+ FN
spatial regions: one per glyph, Info applicable to all of some data? Where to split: all-yes or all-no most
compare+look up values, part-to-whole All->global method informative, equal yes/no least informative;
relationship Some->local method lowest avg entropy:
Normalized stack. barchart: same as H ( p )=− p log 2 ( p )−(1− p) log 2 (1− p)
stacked bar chart, reduces comparability for DM methods:
all cat. except lowest and highest Lin regression: supervised, global
Line chart: 2 quant. att., 1 key, 1 value, Association rule learning:
Clustering: unsupervised, global
points, aligned lengths to express qual. val., Decision tree: supervised, global | X|
separated+ordered by key att into Support of itemset X: supp ( X )=
horizontal regions, find trend, connecting
Association rule learning: unsupervised, n
local
line emphasizes ordering of items along key Support of itemset X ∩ Y:
axis by showing relationship between to Linear regression: supp ( X ∩Y )=¿ X ∩Y ∨ ¿ ¿
items Consider residual y - ŷ betw. real value y n
Heatmap: 2 cat att, 1 quant att, area, and predicted value ŷ = b0 + b1x. Confidence of rule X => Y:
separate+align in 2D matrix, indexed by 2
cat values, color by quant att, find SSD = conf ( X =¿ Y )=¿ X ∩Y ∨ ¿ ¿
clusters+outliers Lower SSD -> better model ¿ X∨¿ ¿
Histogram: table, find distribution(shape),
new table: keys are bins, values are counts, Object system: ‘real’ world of a company,
bin size crucial, related to kernel density organization…
estimate and rug plot Information system: representation of real
Boxplot: table, find distribution(group Best values are world in a computer system using data to
comparison), 5 quant att, median: central represent objects
line, lower+upper quartiles: boxes,
lower+upper fences: whiskers, first quartile Not storing all data in one table: duplication
-1.5IQR, third quartile +1.5IQR, outliers of information, difficulty keeping information
beyond fence shown Higher R2 -> better model consistent, difficulty accessing+sharing
Violin plot: same as boxplot, outliers are data, hard to keep data safe/secure, hard to
represented in density plot Clustering: express interesting analytics
Bar vs line chart: depends on type key att: Centroids represent clusters.
bar if key=cat(nominal), line if key=ordered, K-means clustering algorithm: Database management systems
never line for categorical key: violates Pick k points as centroids (DBMS’s) provide solutions: data
expressiveness principle +trend so strong it Assign points to nearest centroid redundancy+inconsistency, data security,
overrides semantics Recompute centroids: mean of points in efficient data analytics
Box vs violin plot: boxplots hide essential ….cluster
aspects of dataset, violin plots better for Repeat steps 2 and 3 Primary key = unique identifier
representing differences in distribution of How well does centroid represent cluster:
data small distance->good, large distance->bad Logical schema (data model) – logical
structure of database:

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jbtue. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 62890 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€6,49
  • (0)
  Kopen