Samenvatting Data Mining and its Applications
Week 1
- Lecture 1
Data mining is the extraction of interesting information or patterns from large data sources,
which may originally have been developed for other purposes, employing machine and
statistical learning and possibly high-end computational power, in order to serve business
purposes.
Data mining examples: Risk assessment, demand forecasting, fraud detection, anomaly
detection.
From data → knowledge
Data can be at rest, on the move or in use.
There are several data mining stakeholders:
● Business user: Business understanding
● Project Sponsor: Project driver
● Project manager: end to end project delivery
● Business intelligence Analyst: data understanding
● Data administrator & Integrator: data preparation & solution delivery
● Data scientist/ engineer: data modeling and evaluation
Data mining project workflow:
Inception and discovery → Data preparation → Model planning → Model building→
Communicate results → Operationalise
ETL: extraction, Transformation, Loading
Goal of the data understanding phase is gaining general insights about the data that will
potentially be helpful for further steps in the data analysis process. Never trust data until you
have carried out some simple plausibility checks.
Attributes: Features, variables
,Instances: Records, data objects, entries
Data can usually be described in terms of tables or matrices
Attributes differ for their scale type, according to the type of values that they can assume
Three scale types: • Categorical / Nominal • Ordinal • Numeric
Granulariteit, de staat van bestaan in korrels of korrels, verwijst naar de mate waarin een
materiaal of systeem is samengesteld uit te onderscheiden stukken.
Some attributes have a fixed domain (months), some change over time (products in a catalog)
Data quality issues: Availability, usability, reliability, relevance, presentation quality.
Accuracy is defined as the closeness between the value in the data and the true value
→ Syntactic, the value might not be correct but it belongs at least to the domain corresponding
attribtue
→ Semantic, the value might not be in the domain of the corresponding attribute, but it is not
correct.
Data quality issues: completeness
Visualisation charts: Comparison, time series, correlation, value distribution
Chapter 1 - Motivation
Data refer to single instances, describe individual properties, are often available in large
amounts, easy to collect or obtain or do not allow us to make predictions.
Knowledge refers to classes of instances, describes general patterns, structures, laws etc,
consists of as few statements as possible, is often difficult and time consuming to find or to
obtain and allows us to make predictions and forecasts.
Criteria to assess knowledge:
- Correctness
- Generality
- Usefulness
- Comprehensibility
- Novelty
Descriptive statistics summarises data without making specific assumptions about the data.
Inferential statistics provide more rigorous methods than descriptive statistics that are based on
certain assumptions about the data generating random process.
In an experimental study one can control and manipulate the data generating process.
In an observational study one cannot control the data generating process.
Exploratory data analysis is concerned with generating hypotheses from the collected data.
Data science, the opportunity of analysing large real world data repositories that were initially
collected for different purposes that came with the availability of powerful tools and technologies
that can process and analyse massive amounts of data.
CRISP-DM:
, Problem categories:
- Classification, predict the outcome of an experiment with a finite number of possible
results.
- Regression, a prediction task with a numerical value of interest.
- Clustering, summarise the data to get a better overview by forming groups of similar
cases.
- Association analysis, find any correlations or associations to better understand or
describe the interdependencies of all attributes.
- Deviation analysis, knowing already the major trends or structures, find any exceptional
subgroup that behaves differently with respect to some target attribute.
Chapter 2 - Practical data science: an example
An example is described with a naive and a sound approach
Chapter 3 - Project understanding
Determine the project objective: objective, deliverable, success criteria
Assess the situation, assessing resources, clarifying access, evaluating assumptions and risks,
and verifying the suitability of data for the project to avoid wasting resources on potentially
unsuccessful endeavours.
Determine analysis goals: It is crucial to carefully consider the limitations and practical
implications of the chosen architecture to ensure that the developed model aligns with the
intended use and produces valuable results.
Desirable properties: Interpretability, reproducibility, model flexibility, runtime, interestingness
Chapter 4 - Data understanding
Domain is the set of possible values for an attribute.
Scale type: nominal, ordinal, numeric
Granularity is the level of refinement chosen.
Data quality refers to how well the data fit their intended use.
- Accuracy is defined as the closeness between the value in the data and the true value.
- Syntactic accuracy means that a considered value might not be correct, but it belongs at
least to the domain of the corresponding attribute.