Data mining lecture notes
Lecture 1
What is/ is not data mining? Any method that distills [actionable] information/knowledge from
data?
NOT
- Look up phone number in phone directory
- Query a web search engine for information about Amazon
IS
- Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in
Boston area)
- Group together similar documents returned by search engine according to their context (e.g.
Amazon rainforest vs. Amazon.com)
Data mining tasks
- Prediction methods
o Use some variables to predict unknown or future values of other variables
o Classification
Given a collection of records (training set): each record contains a set of
attributes, one of the attributes is the class
Find a model for class attribute as a function of the values of other attributes
Goal: previously unseen records should be assigned a class as accurately as
possible
A test set is used to determine the accuracy of the model
o Regression
Predict a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency
o Time series analysis
o Prediction
- Description methods
o Find human-interpretable patterns that describe the data
o Clustering
Given a set of data points, each having the same set of attributes and a
similarity measure among the, find clusters such that
Data points in one cluster are more similar to one another
Data points in separate clusters are less similar to one another
Similarity measures:
Euclidian distance if attributes are continuous
Other problem-specific measures
o Summarization
o Association rules
Given a set of records each of which contain some number of items from a
given collection: produce dependency rules which will predict occurrence of
an item based on occurrences of other items
o Sequence discovery
,Components of data mining algorithms
- Representation: determining the nature and structure of the representation to be used
- Score function: quantifying and comparing how well different representations fit the data
- Search/optimization method: choosing an algorithmic process to optimize the score function
- Data management: deciding what principles of data management are required to implement
the algorithms efficiently
Challenges of data mining
- Scalability
- Dimensionality
- Complex and heterogeneous data
- Data quality
- Data ownership and distribution
- Privacy
- Accountability
- fairness
Lecture 2
Data is a collection of objects and their attributes. An attribute is a property or
characteristic of an objects. A collection of attributes describe an object.
Attribute values are number or symbols assigned to an attribute. Distinction
between attributes and attribute values:
- Same attribute can be mapped to different attribute values
- Different attributes can be mapped to the same set of values
The type of an attribute depends on which of these apply
- Nominal attribute: distinctness
- Ordinal attribute: distinctness & order
- Interval attribute: distinctness, order & addition
- Ratio attribute: all 4 properties
A discrete attribute has only a finite or countably infinite set of values. Often represented as integer
variables. A continuous attribute has real numbers as attributes values. Typically represented as
floating-point variables.
Types of data sets
- Record
o Data that consist of a collection of records, each of which consists of a fixed set of
attributes
- Data matrix
, oIf data objects have the same fixed set of numeric attributes, then the data objects
can be thought of as points in a multi-dimensional space, where each dimension
represents a distinct attribute
o Such data set can be represented by an m by n matrix, where there are m-rows, one
for each object, and n-columns, one for each attribute
- Document data
o Each document becomes a “term” vector
Each term is a component [attribute] of the vector
The value of each component is the number of times the corresponding term
occurs in the document
- Transaction data
o Special type of record data where each record [transaction] involves a set of items
- Graph
o Word wide web
o Molecular structures
- Ordered
o Spatial data
o Temporal data
o Sequential data
o Genetic sequence data
Data quality
Data is of high quality if they
- Are fit for their intended use
- Correctly represent the phenomena they correspond to
Data quality problems
- Noise
o Modification of original values
- Outliers
o Data objects with characteristics that are considerably different from most [or even
any?] of the other data objects in the data set
- Missing values
o Reasons
Information is not collected
Attributes may not be applicable to all cases
o Handling missing values
Eliminate data objects
Estimate missing values
Ignore the missing value during analysis
Replace with all possible values [weighted by their probabilities]
- Duplicate data
o Data set may include data objects that are duplicates [or almost duplicates] of one
another
o Results in a need for data cleaning
Data preprocessing
, - Aggregation
o Combining multiple attributes [or objects] into a single attribute [or object]
o Purpose
Data reduction
Change of scale
More “stable” data
- Sampling
o Main technique employed for data selection (used for both preliminary investigation
of the data and the final analysis)
o Statisticians sample because obtaining all data of interest is too expensive or time
consuming
o Sampling is used in data mining because processing all data of interest is too
expensive or time consuming
o Key principle
Using a sample will work almost as well as using the entire data sets, if the
sample is representative
A sample is representative if it has approximately the same property [of
interest] as the original data
o Types of sampling
Simple random sampling: equal probability of selecting any particular item
Sampling without replacement: as each item is selected, it is removed from
the population
Sampling with replacement: objects are not removed from the population as
they are selected for the sample (the same object can be picked more than
once)
Stratified sampling: split the data into several partitions; then draw random
samples from each partition (the partitions do not have to be the same size)
- Dimensionality reduction
o Purpose
Avoid curse of dimensionality: when dimensionality increases, data becomes
increasingly sparse in the space that it occupies
Reduce amount of time and memory required by data mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise
o Techniques
Principal Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques
- Feature subset selection
o Brute-force approach : try all possible feature subsets as input to data mining
algorithm
o Embedded approaches : feature selection occurs naturally as part of the data mining
algorithm
o Filter approaches : features are selected before data mining algorithm is run
o Wrapper approaches : use the data mining algorithm as a black box to find best
subset of attributes
- Feature creation
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper donjaschipper. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €4,96. Je zit daarna nergens aan vast.