Categorical data (no intrinsic value) When not visualizing: individual precise ->strive for low within-cluster-distance
Nominal: outcomes that have no values matter, summary and detail values, (W(C)): sum of distances between centroids
..natural order (hair colour: blond,..) scale is broken, decision needed in min time and each observation
Dichotomous: nominal with 2 outcomes W(C) decreases when k increases
Ordinal: outcomes that have natural Effectiveness: how well visualization helps When is clustering informative:
..order (ratings: bad, good,.) person with their tasks K = 1: one cluster -> no
K = n: equal clusters and obs -> no
Numerical data (intrinsic value) Location summary statistics: level 1 < K < n -> yes
Continuous: any value on scale n ! no rule for k, but not minimizing W(C)
1
Interval: equal intervals > equal
…..differences, no fixed 0-point …..
Mean: ∑x
n i=1 i
Equal treatment of att is important:
Use same units for similar attributes
(temperature C, IQ, time) Median: odd obs: middle value when Ensure units used lead to relevant
Ratio: differences and ratio make ordered, even obs: avg two middle values ….distance for problem
…...sense, fixed 0-point (budget, …… Mode: most frequently occurring value Standardize units for dissimilar att
temperature K, distance) Distance linear regression:
Discrete: only certain values (number of..) Scale statistics: spread Residuals/deviations
Range: max – min Determines SSD, and so optimal
Location percentile P =1+(P/100)*(n-1) IQR: 3rd quartile – 1st quartile ….model
Pth perctl val = l+(LP-LPround,down)(h-l) n Basis of quality measure R
1
Lookup: know what and where
Sample var: ∑ (x −x )2
n−1 i=1 i
Distance clustering:
Distances determine clusters
Browse: don’t know what, know where Different attribute scales can be
√
n
Locate: know what, not where 1 ….chosen, influencing distance
Explore: don’t know what nor where Sample sd: ∑ (x −x)2
n−1 i=1 i
Basis of quality measure W(C)
Key attribute = independent attribute MAD: median of absolute deviation from Distance: measure for how close things
Value attribute = dependent attribute median are, how related things are, distances can
be easily compared, no single appropriate
Scatterplot: 2 quant. att., no keys only Sample covariance and correlation: relation distance
values, points, horiz, + vert. position, find 1 Euclidean distance: as the crow flies
trends, outliers, distribution, correlation, s xy = ( x −x )( y i− y ) and Network distance: know network of
clusters n−1 i ..possible movements, network is sparse ..
Bar chart: 1 cat. att. (key) + 1 quant. att. s xy (not too many possible roads)
(value), lines, length to express quant. r xy = Manhattan distance: movement is
value, spatial regions: one per mark, sx s y ..restricted to fixed grid
compare+look up values
Stacked bar chart: 2 cat. att., 1 quant. att, Categories data mining: Decision tree:
vertical stack of line marks, glyph: Predefined target? TP+TN
composite object, internal structure from Yes->supervised method Accuracy =
multiple marks, length and color hue, No->unsupervised method
TP+ TN+ FP+ FN
spatial regions: one per glyph, Info applicable to all of some data? Where to split: all-yes or all-no most
compare+look up values, part-to-whole All->global method informative, equal yes/no least informative;
relationship Some->local method lowest avg entropy:
Normalized stack. barchart: same as H ( p )=− p log 2 ( p )−(1− p) log 2 (1− p)
stacked bar chart, reduces comparability for DM methods:
all cat. except lowest and highest Lin regression: supervised, global
Line chart: 2 quant. att., 1 key, 1 value, Association rule learning:
Clustering: unsupervised, global
points, aligned lengths to express qual. val., Decision tree: supervised, global | X|
separated+ordered by key att into Support of itemset X: supp ( X )=
horizontal regions, find trend, connecting
Association rule learning: unsupervised, n
local
line emphasizes ordering of items along key Support of itemset X ∩ Y:
axis by showing relationship between to Linear regression: supp ( X ∩Y )=¿ X ∩Y ∨ ¿ ¿
items Consider residual y - ŷ betw. real value y n
Heatmap: 2 cat att, 1 quant att, area, and predicted value ŷ = b0 + b1x. Confidence of rule X => Y:
separate+align in 2D matrix, indexed by 2
cat values, color by quant att, find SSD = conf ( X =¿ Y )=¿ X ∩Y ∨ ¿ ¿
clusters+outliers Lower SSD -> better model ¿ X∨¿ ¿
Histogram: table, find distribution(shape),
new table: keys are bins, values are counts, Object system: ‘real’ world of a company,
bin size crucial, related to kernel density organization…
estimate and rug plot Information system: representation of real
Boxplot: table, find distribution(group Best values are world in a computer system using data to
comparison), 5 quant att, median: central represent objects
line, lower+upper quartiles: boxes,
lower+upper fences: whiskers, first quartile Not storing all data in one table: duplication
-1.5IQR, third quartile +1.5IQR, outliers of information, difficulty keeping information
beyond fence shown Higher R2 -> better model consistent, difficulty accessing+sharing
Violin plot: same as boxplot, outliers are data, hard to keep data safe/secure, hard to
represented in density plot Clustering: express interesting analytics
Bar vs line chart: depends on type key att: Centroids represent clusters.
bar if key=cat(nominal), line if key=ordered, K-means clustering algorithm: Database management systems
never line for categorical key: violates Pick k points as centroids (DBMS’s) provide solutions: data
expressiveness principle +trend so strong it Assign points to nearest centroid redundancy+inconsistency, data security,
overrides semantics Recompute centroids: mean of points in efficient data analytics
Box vs violin plot: boxplots hide essential ….cluster
aspects of dataset, violin plots better for Repeat steps 2 and 3 Primary key = unique identifier
representing differences in distribution of How well does centroid represent cluster:
data small distance->good, large distance->bad Logical schema (data model) – logical
structure of database: