100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Data mining - alle colleges $5.40   Add to cart

Class notes

Data mining - alle colleges

1 review
 18 views  1 purchase
  • Course
  • Institution

All data mining colleges summarized.

Preview 4 out of 50  pages

  • October 19, 2024
  • 50
  • 2023/2024
  • Class notes
  • Marco loog & tom klaassen
  • All classes

1  review

review-writer-avatar

By: emirztkn • 1 month ago

avatar-seller
Data mining lecture notes
Lecture 1
What is/ is not data mining? Any method that distills [actionable] information/knowledge from
data?

NOT

- Look up phone number in phone directory
- Query a web search engine for information about Amazon

IS

- Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in
Boston area)
- Group together similar documents returned by search engine according to their context (e.g.
Amazon rainforest vs. Amazon.com)

Data mining tasks

- Prediction methods
o Use some variables to predict unknown or future values of other variables
o Classification
 Given a collection of records (training set): each record contains a set of
attributes, one of the attributes is the class
 Find a model for class attribute as a function of the values of other attributes
 Goal: previously unseen records should be assigned a class as accurately as
possible
 A test set is used to determine the accuracy of the model
o Regression
 Predict a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency
o Time series analysis
o Prediction
- Description methods
o Find human-interpretable patterns that describe the data
o Clustering
 Given a set of data points, each having the same set of attributes and a
similarity measure among the, find clusters such that
 Data points in one cluster are more similar to one another
 Data points in separate clusters are less similar to one another
 Similarity measures:
 Euclidian distance if attributes are continuous
 Other problem-specific measures
o Summarization
o Association rules
 Given a set of records each of which contain some number of items from a
given collection: produce dependency rules which will predict occurrence of
an item based on occurrences of other items
o Sequence discovery

,Components of data mining algorithms
- Representation: determining the nature and structure of the representation to be used
- Score function: quantifying and comparing how well different representations fit the data
- Search/optimization method: choosing an algorithmic process to optimize the score function
- Data management: deciding what principles of data management are required to implement
the algorithms efficiently

Challenges of data mining
- Scalability
- Dimensionality
- Complex and heterogeneous data
- Data quality
- Data ownership and distribution
- Privacy
- Accountability
- fairness

Lecture 2
Data is a collection of objects and their attributes. An attribute is a property or
characteristic of an objects. A collection of attributes describe an object.
Attribute values are number or symbols assigned to an attribute. Distinction
between attributes and attribute values:

- Same attribute can be mapped to different attribute values
- Different attributes can be mapped to the same set of values

Mathematical properties/ operations:

- Distinctness ¿≠
- Order <>
- Addition +-
- Multiplication */

The type of an attribute depends on which of these apply

- Nominal attribute: distinctness
- Ordinal attribute: distinctness & order
- Interval attribute: distinctness, order & addition
- Ratio attribute: all 4 properties

A discrete attribute has only a finite or countably infinite set of values. Often represented as integer
variables. A continuous attribute has real numbers as attributes values. Typically represented as
floating-point variables.

Types of data sets
- Record
o Data that consist of a collection of records, each of which consists of a fixed set of
attributes
- Data matrix

, oIf data objects have the same fixed set of numeric attributes, then the data objects
can be thought of as points in a multi-dimensional space, where each dimension
represents a distinct attribute
o Such data set can be represented by an m by n matrix, where there are m-rows, one
for each object, and n-columns, one for each attribute
- Document data
o Each document becomes a “term” vector
 Each term is a component [attribute] of the vector
 The value of each component is the number of times the corresponding term
occurs in the document
- Transaction data
o Special type of record data where each record [transaction] involves a set of items
- Graph
o Word wide web
o Molecular structures
- Ordered
o Spatial data
o Temporal data
o Sequential data
o Genetic sequence data

Data quality
Data is of high quality if they

- Are fit for their intended use
- Correctly represent the phenomena they correspond to

Data quality problems

- Noise
o Modification of original values
- Outliers
o Data objects with characteristics that are considerably different from most [or even
any?] of the other data objects in the data set
- Missing values
o Reasons
 Information is not collected
 Attributes may not be applicable to all cases
o Handling missing values
 Eliminate data objects
 Estimate missing values
 Ignore the missing value during analysis
 Replace with all possible values [weighted by their probabilities]
- Duplicate data
o Data set may include data objects that are duplicates [or almost duplicates] of one
another
o Results in a need for data cleaning

Data preprocessing

, - Aggregation
o Combining multiple attributes [or objects] into a single attribute [or object]
o Purpose
 Data reduction
 Change of scale
 More “stable” data
- Sampling
o Main technique employed for data selection (used for both preliminary investigation
of the data and the final analysis)
o Statisticians sample because obtaining all data of interest is too expensive or time
consuming
o Sampling is used in data mining because processing all data of interest is too
expensive or time consuming
o Key principle
 Using a sample will work almost as well as using the entire data sets, if the
sample is representative
 A sample is representative if it has approximately the same property [of
interest] as the original data
o Types of sampling
 Simple random sampling: equal probability of selecting any particular item
 Sampling without replacement: as each item is selected, it is removed from
the population
 Sampling with replacement: objects are not removed from the population as
they are selected for the sample (the same object can be picked more than
once)
 Stratified sampling: split the data into several partitions; then draw random
samples from each partition (the partitions do not have to be the same size)
- Dimensionality reduction
o Purpose
 Avoid curse of dimensionality: when dimensionality increases, data becomes
increasingly sparse in the space that it occupies
 Reduce amount of time and memory required by data mining algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise
o Techniques
 Principal Component Analysis
 Singular Value Decomposition
 Others: supervised and non-linear techniques
- Feature subset selection
o Brute-force approach : try all possible feature subsets as input to data mining
algorithm
o Embedded approaches : feature selection occurs naturally as part of the data mining
algorithm
o Filter approaches : features are selected before data mining algorithm is run
o Wrapper approaches : use the data mining algorithm as a black box to find best
subset of attributes
- Feature creation

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller donjaschipper. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.40. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

64438 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$5.40  1x  sold
  • (1)
  Add to cart