College aantekeningen

Data mining - alle colleges

Name: Data mining - alle colleges
SKU: doc_6443011
Rating: 4.00 (1 reviews)
Author: donjaschipper

1 beoordeling

20 keer bekeken 2 keer verkocht

Vak
Data Mining (NWIIBI008)

Instelling
Radboud Universiteit Nijmegen (RU)

Alle colleges van data mining samengevat.

[Meer zien]

Voorbeeld 4 van de 50 pagina's

Bekijk voorbeeld

Geupload op 19 oktober 2024
Aantal pagina's 50
Geschreven in 2023/2024
Type College aantekeningen
Docent(en) Marco loog & tom klaassen
Bevat Alle colleges

1 beoordeling

Door: emirztkn • 2 maanden geleden

Volgen

donjaschipper Lid sinds 2 maanden 3 documenten verkocht

€4,96

Ook beschikbaar in voordeelbundel v.a. €9,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Ook beschikbaar in voordeelbundel (1)

Alle stof data mining

€ 10,52 € 9,49

1x verkocht

2 items

1. College aantekeningen - Data mining - alle colleges
2. Samenvatting - Data mining - all reading material
Meer zien

Data mining lecture notes
Lecture 1
What is/ is not data mining? Any method that distills [actionable] information/knowledge from
data?

NOT

- Look up phone number in phone directory
- Query a web search engine for information about Amazon

IS

- Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in
Boston area)
- Group together similar documents returned by search engine according to their context (e.g.
Amazon rainforest vs. Amazon.com)

Data mining tasks

- Prediction methods
o Use some variables to predict unknown or future values of other variables
o Classification
 Given a collection of records (training set): each record contains a set of
attributes, one of the attributes is the class
 Find a model for class attribute as a function of the values of other attributes
 Goal: previously unseen records should be assigned a class as accurately as
possible
 A test set is used to determine the accuracy of the model
o Regression
 Predict a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency
o Time series analysis
o Prediction
- Description methods
o Find human-interpretable patterns that describe the data
o Clustering
 Given a set of data points, each having the same set of attributes and a
similarity measure among the, find clusters such that
 Data points in one cluster are more similar to one another
 Data points in separate clusters are less similar to one another
 Similarity measures:
 Euclidian distance if attributes are continuous
 Other problem-specific measures
o Summarization
o Association rules
 Given a set of records each of which contain some number of items from a
given collection: produce dependency rules which will predict occurrence of
an item based on occurrences of other items
o Sequence discovery

,Components of data mining algorithms
- Representation: determining the nature and structure of the representation to be used
- Score function: quantifying and comparing how well different representations fit the data
- Search/optimization method: choosing an algorithmic process to optimize the score function
- Data management: deciding what principles of data management are required to implement
the algorithms efficiently

Challenges of data mining
- Scalability
- Dimensionality
- Complex and heterogeneous data
- Data quality
- Data ownership and distribution
- Privacy
- Accountability
- fairness

Lecture 2
Data is a collection of objects and their attributes. An attribute is a property or
characteristic of an objects. A collection of attributes describe an object.
Attribute values are number or symbols assigned to an attribute. Distinction
between attributes and attribute values:

- Same attribute can be mapped to different attribute values
- Different attributes can be mapped to the same set of values

Mathematical properties/ operations:

- Distinctness ¿≠
- Order <>
- Addition +-
- Multiplication */

The type of an attribute depends on which of these apply

- Nominal attribute: distinctness
- Ordinal attribute: distinctness & order
- Interval attribute: distinctness, order & addition
- Ratio attribute: all 4 properties

A discrete attribute has only a finite or countably infinite set of values. Often represented as integer
variables. A continuous attribute has real numbers as attributes values. Typically represented as
floating-point variables.

Types of data sets
- Record
o Data that consist of a collection of records, each of which consists of a fixed set of
attributes
- Data matrix

, oIf data objects have the same fixed set of numeric attributes, then the data objects
can be thought of as points in a multi-dimensional space, where each dimension
represents a distinct attribute
o Such data set can be represented by an m by n matrix, where there are m-rows, one
for each object, and n-columns, one for each attribute
- Document data
o Each document becomes a “term” vector
 Each term is a component [attribute] of the vector
 The value of each component is the number of times the corresponding term
occurs in the document
- Transaction data
o Special type of record data where each record [transaction] involves a set of items
- Graph
o Word wide web
o Molecular structures
- Ordered
o Spatial data
o Temporal data
o Sequential data
o Genetic sequence data

Data quality
Data is of high quality if they

- Are fit for their intended use
- Correctly represent the phenomena they correspond to

Data quality problems

- Noise
o Modification of original values
- Outliers
o Data objects with characteristics that are considerably different from most [or even
any?] of the other data objects in the data set
- Missing values
o Reasons
 Information is not collected
 Attributes may not be applicable to all cases
o Handling missing values
 Eliminate data objects
 Estimate missing values
 Ignore the missing value during analysis
 Replace with all possible values [weighted by their probabilities]
- Duplicate data
o Data set may include data objects that are duplicates [or almost duplicates] of one
another
o Results in a need for data cleaning

Data preprocessing

, - Aggregation
o Combining multiple attributes [or objects] into a single attribute [or object]
o Purpose
 Data reduction
 Change of scale
 More “stable” data
- Sampling
o Main technique employed for data selection (used for both preliminary investigation
of the data and the final analysis)
o Statisticians sample because obtaining all data of interest is too expensive or time
consuming
o Sampling is used in data mining because processing all data of interest is too
expensive or time consuming
o Key principle
 Using a sample will work almost as well as using the entire data sets, if the
sample is representative
 A sample is representative if it has approximately the same property [of
interest] as the original data
o Types of sampling
 Simple random sampling: equal probability of selecting any particular item
 Sampling without replacement: as each item is selected, it is removed from
the population
 Sampling with replacement: objects are not removed from the population as
they are selected for the sample (the same object can be picked more than
once)
 Stratified sampling: split the data into several partitions; then draw random
samples from each partition (the partitions do not have to be the same size)
- Dimensionality reduction
o Purpose
 Avoid curse of dimensionality: when dimensionality increases, data becomes
increasingly sparse in the space that it occupies
 Reduce amount of time and memory required by data mining algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise
o Techniques
 Principal Component Analysis
 Singular Value Decomposition
 Others: supervised and non-linear techniques
- Feature subset selection
o Brute-force approach : try all possible feature subsets as input to data mining
algorithm
o Embedded approaches : feature selection occurs naturally as part of the data mining
algorithm
o Filter approaches : features are selected before data mining algorithm is run
o Wrapper approaches : use the data mining algorithm as a black box to find best
subset of attributes
- Feature creation

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper donjaschipper. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,96. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 49270 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

College aantekeningen

Data mining - alle colleges

Document informatie

Onderwerpen

Geschreven voor

1 beoordeling

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?