Summary data mining and its applications (including book and lectures)
59 views 4 purchases
Course
Data Mining and its Applications (EBB056B05)
Institution
Rijksuniversiteit Groningen (RuG)
Book
Guide to Intelligent Data Science
Summary of the subject data mining and its applications at Rijksuniversiteit Groningen. Year 2 of bedrijfskunde / pre-master. Summary of the relevant chapters of the book and the lecture slides. You can also bring this to the exam.
Samenvatting Data Mining and its Applications (EBB056B05)
Samenvatting Guide to Intelligent Data Science - Data Mining and its Applications (EBB056B05)
All for this textbook (3)
Written for
Rijksuniversiteit Groningen (RuG)
Bedrijfskunde: Technology Management
Data Mining and its Applications (EBB056B05)
All documents for this subject (4)
Seller
Follow
karlijnheikens54
Content preview
Data Mining and its Applica/ons
Week 1 - Chapter 1,2,3 and 4
Chapter 1 - Introduc1on
Data science: The goal of this area was to meet the challenge to develop tools that can help
humans to find potenAally useful paCerns in their data and to solve the problems they are
facing by making beCer use of the data they have.
Data
- Refer to single instances (single objects, people, events, points in Ame, etc.)
- Describe individual properAes
- Are oLen available in large amounts (databases, archives)
- Are oLen easy to collect or to obtain (e.g., scanner cashiers in supermarkets, Internet)
- Do not allow us to make predicAons or forecasts
Data states:
- Data at rest
- Data on the move
- Data in use
Knowledge
- Refers to classes of instances (sets of objects, people, events, points in Ame, etc.)
- Describes general paCerns, structures, laws, principles, etc.
- Consists of as few statements as possible (this is actually an explicit goal, see below)
- Is oLen difficult and Ame consuming to find or to obtain (e.g., natural laws, educaAon)
- Allows us to make predicAons and forecasts
These characterizaAons make it very clear that generally knowledge is much more valuable
than (raw) data.
Enriching the value of data: Data à InformaAon à knowledge à context
Criteria to assess knowledge
- Correctness (probability, success in tests)
- Generality (domain and condiAons of validity)
- Usefulness (relevance, predicAve power)
- Comprehensibility (simplicity, clarity, parsimony)
- Novelty (previously unknown, unexpected)
Sta/s/cs has a long history and originated from collecAng and analyzing data about the
populaAon and the state in general. StaAsAcs can be divided into descripAve and inferenAal
staAsAcs.
- Descrip/ve sta/s/cs summarizes data without making specific assumpAons about the
data, oLen by characterisAc values like the (empirical) mean or by diagrams like
histograms.
1
, - Inferen/al sta/s/cs provides more rigorous methods than descripAve staAsAcs that
are based on certain assumpAons about the data generaAng random process. The
conclusions drawn in inferenAal staAsAcs are only valid if these assumpAons are
saAsfied.
We disAnguish between experimental and observaAonal studies
- In an experimental study one can control and manipulate the data generaAng process.
- In an observa/onal study one cannot control the data generaAng process.
Hypothesis tes/ng: based on the collected data, we desire to either confirm or reject some
hypothesis about the considered domain.
Exploratory data analysis is concerned with generaAng hypotheses from the collected data.
Data science: Powerful tools and technologies that can process and analyze massive amounts
of data.
2
,Problem categories
- Classifica/on
Predict the outcome of an experiment with a finite number of possible results (like
yes/no or unacceptable/acceptable/good/very good). We may be interested in a
predicAon because the true result will emerge in the future or because it is expensive,
difficult, or cumbersome to determine it.
- Regression
Regression is, just like classificaAon, also a predicAon task, but this Ame the value of
interest is numerical in nature.
- Clustering, segmenta/on
Summarize the data to get a beCer overview by forming groups of similar cases (called
clusters or segments). Instead of examining a large number of similar records, we need
to inspect the group summary only. We may also obtain some insight into the structure
of the whole data set. Cases that do not belong to any group may be considered as
abnormal or outliers.
- Associa/on analysis
Find any correlaAons or associaAons to beCer understand or describe the inter-
dependencies of all the aCributes. The focus is on relaAonships between all at- tributes
rather than on a single target variable or the cases (full record).
- Devia/on analysis
Knowing already the major trends or structures, find any excepAonal subgroup that
behaves differently with respect to some target aCribute.
Catalog of Methods
- Finding paIerns
If the domain (and therefore the data) is new to us or if we expect to find interest- ing
relaAonships, we explore the data for new, previously unknown paCerns. We want to
get a full picture and do not concentrate on a single target aCribute, yet. We may apply
methods from, for instance, segmentaAon, clustering, associaAon analysis, or deviaAon
analysis.
- Finding explana/ons
We have a special interest in some target variable and wonder why and how it varies
from case to case. The primary goal is to gain new insights (knowledge) that may
influence our decision making, but we do not necessarily intend automaAon. We may
apply methods from, for instance, classificaAon, regression, associaAon analysis, or
deviaAon analysis.
- Finding predictors
We have a special interest in the predicAon of some target variable, but it (possibly)
represents only one building block of our full problem, so we do not really care about
the how and why but are just interested in the best-possible predicAon. We may apply
methods from, for instance, classificaAon or regression.
3
, Chapter 3: Project understanding
Project understanding: In this iniAal phase of the data analysis project, we have to map a
problem onto one or many data analysis tasks. The project understanding phase should be
carried out with care to keep the project on the right track.
Problem source Project owner perspecAve Analyst perspecAve
CommunicaAon Project owner does not Analyst does not understand
understand the technical the terms of the domain of
terms of the analyst the project owner
Lack of understanding Project owner was not sure Analyst found it hard to
what the analyst could do or understand how to help the
achieve project owner
Models of analyst were
different from what the
project owner envisioned
OrganizaAon Requirements had to be Project owner was an
adopted in later stages as unpredictable group (not so
problems with the data concerned with the project
became evident
Data mining stakeholders
- Business User: business understanding
Has a sound understanding of the business domain targeted by the data mining project.
The person can offer insight into the project context, the business value sought to be
extracted via data mining and advise on how results can be operaAonalized. A Business
Analyst and/or a Line Manager might be suitable for such a role.
- Project Sponsor: project driver
In most cases the iniAator or driver for the data mining project. Concerned with the
potenAal Return On Investment (ROI) and sets prioriAes and desired outputs. This
person is championing the project, moAvaAng engagement of key personnel around
the business problem
- Project Manager: end to end project delivery
This person is in charge for the data mining project implementaAon and is concerned
with meeAng goals for quality, Ame, and budget targets.
- Business Intelligence Analyst: data understanding
This person acts as the bridge between the data and the business view of the targeted
problem. Maintaining a sound understanding of relevant data, the Business
Intelligence Analyst is driving acAviAes related to Key Performance Indicators (KPIs) and
extracAng relevant data for reporAng and dashboarding purposes. Understands
sources and ‘consumers’ of data, as well as need for changes in data management
processes.
- Data Administrator & Integrator: data prepara/on & solu/on delivery
Provides acAon support for implemenAng key data access and processing acAviAes,
needed by stakeholders of the data mining project. A technical person with sound data
management competences, including awareness of security and/or privacy concerns
would be appropriate.
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller karlijnheikens54. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $9.62. You're not tied to anything after your purchase.