Data Mining For Business And Governance (880022M6)
Summary
Summary Data Mining For Business And Governance (880022-M-6)
99 views 0 purchase
Course
Data Mining For Business And Governance (880022M6)
Institution
Tilburg University (UVT)
Detailed summary of all lectures and additional notes, explanations and examples for the course "Data Mining for Business and Governance" at Tilburg University which is part of the Master Data Science and Society. Course was given by Ç. Güven, G.R. Nápoles during the second semester, block three...
Data Mining For Business And Governance (880022M6)
All documents for this subject (8)
Seller
Follow
hannahgruber
Reviews received
Content preview
Tilburg University
Study Program: Master Data Science and Society
Academic Year 2021/2022, Semester 2, Block 3 (January to March 2022)
Course: Data Mining for Business and Governance (880022-M-6)
Lecturers: Ç. Güven, G.R. Nápoles
,Introduction to Data Mining
• no fixed definition, umbrella term
o Knowledge discovery in databases, Statistics, Artificial Intelligence, Machine learning
• Computation vs large data sets: trade-off between processing time and memory
o the larger the dataset, the more computational resources are needed
• Large amounts or big data: Volume, Variety, Velocity
Pipeline of a data mining task
Basic data types
• Dependency oriented: explicit or implicit relationships
• Non-Dependency oriented: no specified dependency between records (multidimensional
data)
• For many machine learning models, observations are assumed to be independent
What makes prediction possible?
• Associations between features/target, understand how datapoints are related
• Numerical: correlation coefficient
• Categorical: mutual information Value of x1 contains information about value of x2
Correlation coefficient
• Pearson's r/R measures the strength of linear relationship (dependency), no other shapes
• range (-1,1), the lower the number, the more dispersed the data is, 0 = randomly distributed
• for a strong linear relationship between two features, one of the features can be linearly
expressed in terms of the other and that makes one of those redundant in analysis
•
• Numerator: covariance (to what extent the features change together)
• Denominator: product of standard deviations (makes correlations independent of units)
Correlation versus causation
• Correlation does not imply causation
• correlation is a coincidence
• explain and check causation in an experimental study
o vary a single variable while the others are kept equal
, Supervised learning
• use labeled data to train the algorithm
• classification and regression problems
learning workflow
• 1) collect data
o consider reliability of measurement, privacy, and other regulations
o split data into training, validation, and test set with similar structure
▪ training set for learning
▪ validation set for tuning and setting hyperparameters
▪ test set for final evaluation
• 2) label examples (sometimes part of data collection)
o Annotation guidelines, Measure inter-annotator agreement, Crowdsourcing
• 3) choose representation (part of preprocessing)
o Features: attributes describing examples
o Observations: observed values for a given attribute
▪ numerical features: discrete or continuous
▪ categorical / nominal features, binary features
▪ ordinal features (scale)
o features can be converted to a vector
o ‘feature transformation’: e.g., use dummy coding to transform a categorical feature
to a numerical one
o ‘feature extraction’: select relevant features which represent the input and define
the output
• 4) train model(s)
o hyperparameters: settings for an algorithm decided by the programmer
▪ for each value of hyperparameter:
1) Apply algorithm to training set to learn
2) Check performance on validation set
3) Find/Choose best-performing setting
• 5) evaluate
o Check performance of tuned model on test set
o Goal: estimate how well your model will do in the real world (generalization)
regression task: predicting a numeric quantity
• regression analysis describes the relationship between random variables
• it can predict the value of one variable based on another variable and show trends
• output of regression problem is a function describing the relation between x and y
• numerical prediction (predict values for continuous variables) possible unlike classification
linear regression
• simplest regression technique with two types of variables
• aim is to minimize the difference between the predicted and the actual values
• measurements
o sum of squared errors
o or different loss functions
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller hannahgruber. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $6.36. You're not tied to anything after your purchase.