Lecture 1 – Introduction
Knowledge Discovery in Databases (KDD)
➢ The process of (semi-) automatic extraction of knowledge from databases/ process of
discovering useful knowledge from a collection of data, which is
o Valid
o Previously unknown
o Potentially useful
➢ Interdisciplinary field:
o Database systems
▪ Scalability for large datasets – integration from different sources – novel data
types (text)
o Statistics
▪ Probabilistic knowledge – model-based inferences – evaluation of knowledge
o Machine learning
▪ Different paradigms of learning – supervised learning – hypothesis spaces
and search strategies
➢ KDD Process Model
Visual Analytics
➢ Data → visualization → gain insights
➢ Importance of visualization: make both calculations and
graphs. Both sorts of output should be studies; each will
contribute to understanding.
➢ Goals of visualization:
o Presentation
▪ Starting point: facts to be presented are a fixed priority
▪ Process: choice of appropriate presentation techniques
▪ Result: high-quality visualization of the data to present facts
o Confirmatory analysis
▪ Starting point: hypotheses about the data
▪ Process: goal-oriented examination of the hypotheses
▪ Result: visualization of data to confirm or reject the hypotheses
o Exploratory analysis
▪ Starting point: no hypotheses about the data
1
, ▪ Process: interactive, usually
undirected search for structures,
trends
▪ Result: visualization of data to
lead to hypotheses about the data
➢ Visualization: the process of presenting data in a
form that allows rapid understanding of
relationships and findings that are not readily
evident from raw data
➢ 2 ways of going through conceptual pipeline (data
→ visualization OR data → models)
➢ Sense making loop → not a one-way street, but a loop => knowledge generation loop
Lecture 2 – Data Foundations 1
Types of data
➢ Data can be gathered/ generated from many sources. Independent of the source, each data
point has a data type
o Nominal & ordinal => categorial or discrete values
o Numeric => continuous scale
➢ Nominal
o Discrete; not the same values, but no specific ranking →classification without order
(ID, gender)
o No quantitive relationship between categories
➢ Ordinal
o noise comparison; difference in values (one is louder); rank order → attributes can be
rank-ordered
o distance can be arbitrary (smoking habits) → distances between values do not have
any meaning
➢ Numeric
o difference in height; attributes can be rank-ordered
o distances between values have a meaning
o calculations with the data are possible! (height of X = height of Y+5/2); meaningful
distance between values where mathematical operations are possible (age, time)
Typical data classes
➢ Scalar: an individual number in a data record
➢ Multivariate and Mulitdimensional data: multiple variables within a single record can
represent a composite data item; not always easy to calculate difference (bv gender and
weight comparison)
➢ Vector: it is common to treat the vector as a whole; telephone number that can be divided into
country/ region code
➢ Network data: vertices on a surface are connected to their neighbors via edges
➢ Hierarchical data: relationships between nodes in a hierarchy can be specified by links
➢ Time-series data: a complex way of looking into data; time has the widest range of possible
values
o Ducks
o Ordinal: gender
o Numeric: amount
o Vector: location
o Network: parent/child
o Hierarchical: leader/ follower
o Time series: movement
2
,Data preprocessing: Data cleaning
➢ Rubbish in – rubbish out. You have to be certain that you will do data cleaning (treat missing
values) → low-quality data will lead to low-quality mining results
➢ Data cleaning → missing values:
o ignore the tuple (hele rij verwijderen)
▪ + easily done, no computational effort
▪ - loss of information, unnecessary if the attribute is not needed
o fill in the missing value manually
▪ + effective for small datasets
▪ - need to know the value, time consuming, not feasible with large datasets
o use a global constant (-1: don’t use this value for calculations for algorithm)
▪ + can be easily done, perhaps interesting to know the missing value
o use the attribute mean
▪ + simple to implement
▪ - not the most accurate approximation of the value
o use the most probable value
▪ + most accurate approximation of the value
▪ - most computational effort
➢ Data cleaning → noisy data: a random error or variance in a measured variable
o Smooth out the noise!
o Systematic error: sensor always senses a little bit higher – same frequency curve but
shifts to a direction. Average is the same
o Handling noisy data:
▪ Binning: sort data and partition into (equi-depth) bins and then
smooth by bin means, bin median, bin boundaries, etc.
• Equal-width binning:
o Divides the range into N intervals of equal size
o Width of the intervals: width = (max-min)/ N
o Simple
o Outliers may dominate result
• Equal-depth binning:
o Divides the range into N intervals
o Each interval contains approximately the same
number of records
o Skewed data is also handled well
3
, ▪ Regression: smooth out noise by fitting a regression function
o Assume our data can be modelled ‘easily’
o Global linear regression models may not be adequate for
“nonlinear” data
o The regression model can be static or dynamic
▪ Static: using only the historical data to calculate the
function
▪ Dynamic: also use new data to adapt the model
• Linear regression
o Tries to discover the parameters of the straight-line equation
that best fits the data point → line that reduces the squared
error of all data points
➢ B = 0.6857 is
a very slight slope.
• Non-linear regression (slides)
▪ Clustering: cluster data and remove outliers
(automatically or via human inspection)
Lecture 3 – Data Foundations 2
Best regression line is closest to the points. Almost touching points
sometimes, but the distance is never huge (outlier).
Continuing on Data Preprocessing. Data cleaning discussed, now:
Data Preprocessing: Norminalisation
➢ Linear normalization
➢ Square root normalization (overall wortel van)
➢ Logarithmic normalization (ln() van alles)
➢ Possible solutions for data streams (problem when adding data
to the table, e.g. new min/ max values)
o Rerun the normalization
+ overall correct data representation
- computationally expensive
- perception of previous results distorted
4
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper IsabelleU. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €8,49. Je zit daarna nergens aan vast.