Samenvatting Data Mining and its Applications (EBB056B05)
3 views 0 purchase
Course
Data Mining and its Applications (EBB056B05)
Institution
Rijksuniversiteit Groningen (RuG)
Book
Guide to Intelligent Data Science
Summary of the Data Mining and its Applications lectures, all slides of all lectures are included here and supplemented with material from the ChatGPT book/explanation. I myself got an 8.5 in the exam with this summary included.
Samenvatting Guide to Intelligent Data Science - Data Mining and its Applications (EBB056B05)
Summary data mining and its applications (including book and lectures)
All for this textbook (3)
Written for
Rijksuniversiteit Groningen (RuG)
Bedrijfskunde: Technology Management
Data Mining and its Applications (EBB056B05)
All documents for this subject (4)
Seller
Follow
donnakartoidjojo
Content preview
Lecture 1............................................................................................................................... 3
Lecture 2: Regression..........................................................................................................8
R-squared vs. RMSE.................................................................................................... 10
Linear regression:....................................................................................................... 11
Polynomial regression:................................................................................................12
Regression tree: the algorithm....................................................................................12
Bootstrap AGGregating (Bagging): for each tree/model a training ste is generated by
sampling uniformly with replacement from the standard training set...........................13
Generalization............................................................................................................. 16
Advantages of 5-Fold Cross-Validation...................................................................17
Lecture 3: Time series analysis.......................................................................................... 17
Seasonal effect:..........................................................................................................18
Exponential smoothing............................................................................................... 21
Stationarity................................................................................................................ 22
A seasonal difference is the difference between an observation and the corresponding
observation from the previous (seasonal) cycle...........................................................23
ARIMA Models:........................................................................................................... 24
Sequence segmentation.............................................................................................29
Characteristics of a time series................................................................................... 31
Lecture 4: clustering......................................................................................................... 32
Hierarchical Clustering (Linkage-Based Clustering).................................................... 32
K-Means Clustering (Model-Based Clustering).............................................................32
Density-Based Clustering (DBScan)............................................................................ 33
Example:...............................................................................................................34
Importance of MinPts:...........................................................................................34
Clustering Evaluation..................................................................................................34
Attribute Weighting.................................................................................................... 46
Prototype & model-based (k-means,... clustering).......................................................47
Partitioning; goal: a (disjoint) partitioning into k clusters with minimal costs.............. 47
K-means.....................................................................................................................48
Outliers: k-means vs. k-medoids.................................................................................48
Density-based clustering............................................................................................49
Clustering evaluation...................................................................................................51
Lecture 5: Classifiers; Decision Trees, Model validation...................................................56
Decision Trees............................................................................................................56
1
, Evaluation measures - Shannon Entropy.....................................................................63
Gain Ratio...................................................................................................................70
Gini Index.................................................................................................................... 71
x^2 measure............................................................................................................... 72
Decision Trees - Missing Values...................................................................................73
Pruning.......................................................................................................................74
Reduced Error Pruning................................................................................................76
Pessimistic Pruning.................................................................................................... 76
Model Validation......................................................................................................... 78
Lecture 6: Additional topics on Data Mining......................................................................86
Lecture 7: overview............................................................................................................ 91
ChatGPT..............................................................................................................................92
Example Usage..................................................................................................... 92
Row Splitter Node............................................................................................92
Partitioning Node............................................................................................ 92
Practical Example................................................................................................. 93
How Gain Ratio is Calculated:................................................................................ 93
Example Use:........................................................................................................ 93
How Gini Index is Calculated:.................................................................................94
Purpose of the Gini Index:..................................................................................... 94
Example Use:........................................................................................................94
Characteristics of String Variables........................................................................ 95
Use in Data Mining................................................................................................. 95
Handling String Variables...................................................................................... 95
Example................................................................................................................96
2
,Lecture 1
What is data mining?
→ the extraction of interesting information or patterns from large data sets, which may originally have been
developed for other purposes.
Data states:
● Data at rest
● Data on the move
● Data in use
From data to knowledge:
Data mining project understanding
- What is the primary objective?
- What are the criteria for success?
3
, - These are difficult to define
- Stakeholders involved in the data analysis/mining process speak different languages
Data Mining Stakeholders
● Business User: business understanding
○ Has a sound understanding of the business domain targeted by the data mining project. The
person can offer insight into the project context, the business value sought to be extracted via
data mining and advise on how results can be operationalized.
● Project Sponsor: project driver
○ The initiator or driver for the data mining project. Concerned with the potential ROI and sets
priorities and desired outputs. This person is championing the project, motivating
engagement of key personnel around the business problem.
● Project Manager: end-to-end project delivery
○ In charge for the data mining project implementation and is concerned with meeting goals for
quality, time and budget targets.
● Business Intelligence Analyst: data understanding
○ Bridge between the data and the business view of the targeted problem. Maintaining a sound
understanding of relevant data, the Business Intelligence Analyst is driving activities related to
Key Performance Indicators (KPIs) and extracting relevant data for reporting and dashboarding
purposes. Understands sources and ‘consumers’ of data, as well as need for changes in data
management processes
● Data Administrator & Integrator: data preparation & solution delivery
○ Provides action support for implementing key data access and processing activities, needed
by stakeholders of the data mining project. A technical person with sound data management
competences, including awareness of security and/or privacy concerns would be appropriate.
● Data Scientist/Engineer: data modeling of evaluation
○ This person combines data management skills with a sound understanding of data analysis
methods and tools and is driving the ingestion of data into the overall data analytics process.
The data scientist is able to communicate the analytics methods to the other stakeholders.
→ the data engineer and administrator + integrator are working closely on the technical side of data mining
and share relevant code and documentation.
Data Mining Project Workflow
1. Inception and discovery
a. Tool to sketch beliefs, experiences, known factors
b. How often will a certain product be found in a basket?
2. Data preparation
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller donnakartoidjojo. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $6.96. You're not tied to anything after your purchase.