Categorical data (no intrinsic value) When not visualizing: individual precise ->strive for low within-cluster-distance
Nominal: outcomes that have no values matter, summary and detail values, (W(C)): sum of distances between centroids
..natural order (hair colour: blond,..) scale is broken, decision needed in min time and each observation
Dichotomous: nominal with 2 outcomes W(C) decreases when k increases
Ordinal: outcomes that have natural Effectiveness: how well visualization helps When is clustering informative:
..order (ratings: bad, good,.) person with their tasks K = 1: one cluster -> no
K = n: equal clusters and obs -> no
Numerical data (intrinsic value) Location summary statistics: level 1 < K < n -> yes
Continuous: any value on scale n ! no rule for k, but not minimizing W(C)
1
Interval: equal intervals > equal
…..differences, no fixed 0-point …..
Mean: ∑x
n i=1 i
Equal treatment of att is important:
Use same units for similar attributes
(temperature C, IQ, time) Median: odd obs: middle value when Ensure units used lead to relevant
Ratio: differences and ratio make ordered, even obs: avg two middle values ….distance for problem
…...sense, fixed 0-point (budget, …… Mode: most frequently occurring value Standardize units for dissimilar att
temperature K, distance) Distance linear regression:
Discrete: only certain values (number of..) Scale statistics: spread Residuals/deviations
Range: max – min Determines SSD, and so optimal
Location percentile P =1+(P/100)*(n-1) IQR: 3rd quartile – 1st quartile ….model
Pth perctl val = l+(LP-LPround,down)(h-l) n Basis of quality measure R
1
Lookup: know what and where
Sample var: ∑ (x −x )2
n−1 i=1 i
Distance clustering:
Distances determine clusters
Browse: don’t know what, know where Different attribute scales can be
√
n
Locate: know what, not where 1 ….chosen, influencing distance
Explore: don’t know what nor where Sample sd: ∑ (x −x)2
n−1 i=1 i
Basis of quality measure W(C)
Key attribute = independent attribute MAD: median of absolute deviation from Distance: measure for how close things
Value attribute = dependent attribute median are, how related things are, distances can
be easily compared, no single appropriate
Scatterplot: 2 quant. att., no keys only Sample covariance and correlation: relation distance
values, points, horiz, + vert. position, find 1 Euclidean distance: as the crow flies
trends, outliers, distribution, correlation, s xy = ( x −x )( y i− y ) and Network distance: know network of
clusters n−1 i ..possible movements, network is sparse ..
Bar chart: 1 cat. att. (key) + 1 quant. att. s xy (not too many possible roads)
(value), lines, length to express quant. r xy = Manhattan distance: movement is
value, spatial regions: one per mark, sx s y ..restricted to fixed grid
compare+look up values
Stacked bar chart: 2 cat. att., 1 quant. att, Categories data mining: Decision tree:
vertical stack of line marks, glyph: Predefined target? TP+TN
composite object, internal structure from Yes->supervised method Accuracy =
multiple marks, length and color hue, No->unsupervised method
TP+ TN+ FP+ FN
spatial regions: one per glyph, Info applicable to all of some data? Where to split: all-yes or all-no most
compare+look up values, part-to-whole All->global method informative, equal yes/no least informative;
relationship Some->local method lowest avg entropy:
Normalized stack. barchart: same as H ( p )=− p log 2 ( p )−(1− p) log 2 (1− p)
stacked bar chart, reduces comparability for DM methods:
all cat. except lowest and highest Lin regression: supervised, global
Line chart: 2 quant. att., 1 key, 1 value, Association rule learning:
Clustering: unsupervised, global
points, aligned lengths to express qual. val., Decision tree: supervised, global | X|
separated+ordered by key att into Support of itemset X: supp ( X )=
horizontal regions, find trend, connecting
Association rule learning: unsupervised, n
local
line emphasizes ordering of items along key Support of itemset X ∩ Y:
axis by showing relationship between to Linear regression: supp ( X ∩Y )=¿ X ∩Y ∨ ¿ ¿
items Consider residual y - ŷ betw. real value y n
Heatmap: 2 cat att, 1 quant att, area, and predicted value ŷ = b0 + b1x. Confidence of rule X => Y:
separate+align in 2D matrix, indexed by 2
cat values, color by quant att, find SSD = conf ( X =¿ Y )=¿ X ∩Y ∨ ¿ ¿
clusters+outliers Lower SSD -> better model ¿ X∨¿ ¿
Histogram: table, find distribution(shape),
new table: keys are bins, values are counts, Object system: ‘real’ world of a company,
bin size crucial, related to kernel density organization…
estimate and rug plot Information system: representation of real
Boxplot: table, find distribution(group Best values are world in a computer system using data to
comparison), 5 quant att, median: central represent objects
line, lower+upper quartiles: boxes,
lower+upper fences: whiskers, first quartile Not storing all data in one table: duplication
-1.5IQR, third quartile +1.5IQR, outliers of information, difficulty keeping information
beyond fence shown Higher R2 -> better model consistent, difficulty accessing+sharing
Violin plot: same as boxplot, outliers are data, hard to keep data safe/secure, hard to
represented in density plot Clustering: express interesting analytics
Bar vs line chart: depends on type key att: Centroids represent clusters.
bar if key=cat(nominal), line if key=ordered, K-means clustering algorithm: Database management systems
never line for categorical key: violates Pick k points as centroids (DBMS’s) provide solutions: data
expressiveness principle +trend so strong it Assign points to nearest centroid redundancy+inconsistency, data security,
overrides semantics Recompute centroids: mean of points in efficient data analytics
Box vs violin plot: boxplots hide essential ….cluster
aspects of dataset, violin plots better for Repeat steps 2 and 3 Primary key = unique identifier
representing differences in distribution of How well does centroid represent cluster:
data small distance->good, large distance->bad Logical schema (data model) – logical
structure of database:
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller jbtue. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $6.95. You're not tied to anything after your purchase.