Tilburg University
Study Program: Master Data Science and Society
Academic Year 2021/2022, Semester 2, Block 3 (January to March 2022)
Course: Data Mining for Business and Governance (880022-M-6)
Lecturers: Ç. Güven, G.R. Nápoles
,Introduction to Data Mining
• no fixed definition, umbrella term
o Knowledge discovery in databases, Statistics, Artificial Intelligence, Machine learning
• Computation vs large data sets: trade-off between processing time and memory
o the larger the dataset, the more computational resources are needed
• Large amounts or big data: Volume, Variety, Velocity
Pipeline of a data mining task
Basic data types
• Dependency oriented: explicit or implicit relationships
• Non-Dependency oriented: no specified dependency between records (multidimensional
data)
• For many machine learning models, observations are assumed to be independent
What makes prediction possible?
• Associations between features/target, understand how datapoints are related
• Numerical: correlation coefficient
• Categorical: mutual information Value of x1 contains information about value of x2
Correlation coefficient
• Pearson's r/R measures the strength of linear relationship (dependency), no other shapes
• range (-1,1), the lower the number, the more dispersed the data is, 0 = randomly distributed
• for a strong linear relationship between two features, one of the features can be linearly
expressed in terms of the other and that makes one of those redundant in analysis
•
• Numerator: covariance (to what extent the features change together)
• Denominator: product of standard deviations (makes correlations independent of units)
Correlation versus causation
• Correlation does not imply causation
• correlation is a coincidence
• explain and check causation in an experimental study
o vary a single variable while the others are kept equal
, Supervised learning
• use labeled data to train the algorithm
• classification and regression problems
learning workflow
• 1) collect data
o consider reliability of measurement, privacy, and other regulations
o split data into training, validation, and test set with similar structure
▪ training set for learning
▪ validation set for tuning and setting hyperparameters
▪ test set for final evaluation
• 2) label examples (sometimes part of data collection)
o Annotation guidelines, Measure inter-annotator agreement, Crowdsourcing
• 3) choose representation (part of preprocessing)
o Features: attributes describing examples
o Observations: observed values for a given attribute
▪ numerical features: discrete or continuous
▪ categorical / nominal features, binary features
▪ ordinal features (scale)
o features can be converted to a vector
o ‘feature transformation’: e.g., use dummy coding to transform a categorical feature
to a numerical one
o ‘feature extraction’: select relevant features which represent the input and define
the output
• 4) train model(s)
o hyperparameters: settings for an algorithm decided by the programmer
▪ for each value of hyperparameter:
1) Apply algorithm to training set to learn
2) Check performance on validation set
3) Find/Choose best-performing setting
• 5) evaluate
o Check performance of tuned model on test set
o Goal: estimate how well your model will do in the real world (generalization)
regression task: predicting a numeric quantity
• regression analysis describes the relationship between random variables
• it can predict the value of one variable based on another variable and show trends
• output of regression problem is a function describing the relation between x and y
• numerical prediction (predict values for continuous variables) possible unlike classification
linear regression
• simplest regression technique with two types of variables
• aim is to minimize the difference between the predicted and the actual values
• measurements
o sum of squared errors
o or different loss functions