100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Data mining - alle colleges €4,96   In winkelwagen

College aantekeningen

Data mining - alle colleges

 2 keer bekeken  0 keer verkocht

Alle colleges van data mining samengevat.

Voorbeeld 4 van de 50  pagina's

  • 19 oktober 2024
  • 50
  • 2023/2024
  • College aantekeningen
  • Marco loog & tom klaassen
  • Alle colleges
Alle documenten voor dit vak (4)
avatar-seller
donjaschipper
Data mining lecture notes
Lecture 1
What is/ is not data mining? Any method that distills [actionable] information/knowledge from
data?

NOT

- Look up phone number in phone directory
- Query a web search engine for information about Amazon

IS

- Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in
Boston area)
- Group together similar documents returned by search engine according to their context (e.g.
Amazon rainforest vs. Amazon.com)

Data mining tasks

- Prediction methods
o Use some variables to predict unknown or future values of other variables
o Classification
 Given a collection of records (training set): each record contains a set of
attributes, one of the attributes is the class
 Find a model for class attribute as a function of the values of other attributes
 Goal: previously unseen records should be assigned a class as accurately as
possible
 A test set is used to determine the accuracy of the model
o Regression
 Predict a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency
o Time series analysis
o Prediction
- Description methods
o Find human-interpretable patterns that describe the data
o Clustering
 Given a set of data points, each having the same set of attributes and a
similarity measure among the, find clusters such that
 Data points in one cluster are more similar to one another
 Data points in separate clusters are less similar to one another
 Similarity measures:
 Euclidian distance if attributes are continuous
 Other problem-specific measures
o Summarization
o Association rules
 Given a set of records each of which contain some number of items from a
given collection: produce dependency rules which will predict occurrence of
an item based on occurrences of other items
o Sequence discovery

,Components of data mining algorithms
- Representation: determining the nature and structure of the representation to be used
- Score function: quantifying and comparing how well different representations fit the data
- Search/optimization method: choosing an algorithmic process to optimize the score function
- Data management: deciding what principles of data management are required to implement
the algorithms efficiently

Challenges of data mining
- Scalability
- Dimensionality
- Complex and heterogeneous data
- Data quality
- Data ownership and distribution
- Privacy
- Accountability
- fairness

Lecture 2
Data is a collection of objects and their attributes. An attribute is a property or
characteristic of an objects. A collection of attributes describe an object.
Attribute values are number or symbols assigned to an attribute. Distinction
between attributes and attribute values:

- Same attribute can be mapped to different attribute values
- Different attributes can be mapped to the same set of values

Mathematical properties/ operations:

- Distinctness ¿≠
- Order <>
- Addition +-
- Multiplication */

The type of an attribute depends on which of these apply

- Nominal attribute: distinctness
- Ordinal attribute: distinctness & order
- Interval attribute: distinctness, order & addition
- Ratio attribute: all 4 properties

A discrete attribute has only a finite or countably infinite set of values. Often represented as integer
variables. A continuous attribute has real numbers as attributes values. Typically represented as
floating-point variables.

Types of data sets
- Record
o Data that consist of a collection of records, each of which consists of a fixed set of
attributes
- Data matrix

, oIf data objects have the same fixed set of numeric attributes, then the data objects
can be thought of as points in a multi-dimensional space, where each dimension
represents a distinct attribute
o Such data set can be represented by an m by n matrix, where there are m-rows, one
for each object, and n-columns, one for each attribute
- Document data
o Each document becomes a “term” vector
 Each term is a component [attribute] of the vector
 The value of each component is the number of times the corresponding term
occurs in the document
- Transaction data
o Special type of record data where each record [transaction] involves a set of items
- Graph
o Word wide web
o Molecular structures
- Ordered
o Spatial data
o Temporal data
o Sequential data
o Genetic sequence data

Data quality
Data is of high quality if they

- Are fit for their intended use
- Correctly represent the phenomena they correspond to

Data quality problems

- Noise
o Modification of original values
- Outliers
o Data objects with characteristics that are considerably different from most [or even
any?] of the other data objects in the data set
- Missing values
o Reasons
 Information is not collected
 Attributes may not be applicable to all cases
o Handling missing values
 Eliminate data objects
 Estimate missing values
 Ignore the missing value during analysis
 Replace with all possible values [weighted by their probabilities]
- Duplicate data
o Data set may include data objects that are duplicates [or almost duplicates] of one
another
o Results in a need for data cleaning

Data preprocessing

, - Aggregation
o Combining multiple attributes [or objects] into a single attribute [or object]
o Purpose
 Data reduction
 Change of scale
 More “stable” data
- Sampling
o Main technique employed for data selection (used for both preliminary investigation
of the data and the final analysis)
o Statisticians sample because obtaining all data of interest is too expensive or time
consuming
o Sampling is used in data mining because processing all data of interest is too
expensive or time consuming
o Key principle
 Using a sample will work almost as well as using the entire data sets, if the
sample is representative
 A sample is representative if it has approximately the same property [of
interest] as the original data
o Types of sampling
 Simple random sampling: equal probability of selecting any particular item
 Sampling without replacement: as each item is selected, it is removed from
the population
 Sampling with replacement: objects are not removed from the population as
they are selected for the sample (the same object can be picked more than
once)
 Stratified sampling: split the data into several partitions; then draw random
samples from each partition (the partitions do not have to be the same size)
- Dimensionality reduction
o Purpose
 Avoid curse of dimensionality: when dimensionality increases, data becomes
increasingly sparse in the space that it occupies
 Reduce amount of time and memory required by data mining algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise
o Techniques
 Principal Component Analysis
 Singular Value Decomposition
 Others: supervised and non-linear techniques
- Feature subset selection
o Brute-force approach : try all possible feature subsets as input to data mining
algorithm
o Embedded approaches : feature selection occurs naturally as part of the data mining
algorithm
o Filter approaches : features are selected before data mining algorithm is run
o Wrapper approaches : use the data mining algorithm as a black box to find best
subset of attributes
- Feature creation

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper donjaschipper. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,96. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 82388 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€4,96
  • (0)
  Kopen