Data Mining for Business and Governance (880022M6)
Instelling
Tilburg University (UVT)
Maximize your chances of acing the exam with this meticulously crafted cheat sheet. Designed specifically for exam success, it distills complex data mining concepts into easy-to-understand formulas and techniques, including the XAI lecture. Tailored to fit exam guidelines, this sheet is your allowe...
Data Mining for Business and Governance (880022M6)
Alle documenten voor dit vak (1)
1
beoordeling
Door: tygovandenherik1 • 2 weken geleden
Verkoper
Volgen
giyantogoossens
Ontvangen beoordelingen
Voorbeeld van de inhoud
Week 1: Missing Values: Remove Features if majority of in- Example: Consider dataset with sunburned people: 3 out of 8 get sun- is assigned to the instance. Overall, bagging decreases the variance.
stances are missing; Removing Instances is not ideal with limited burned (Entropy = 0.9544). For hair color, blond has 4 people with a Boosting: Prioritize misclassified instances by weight adjustment.
data; Replacing with mean, median, or mode introduces noise; Au- split of 2 sunburned, 2 not (Entropy = 1); brown has 3 people, none Retrain model till criterion is met. Boosting is a sequential ensemble
toencoders are unsupervised neural nets with Encoder & Decoder sunburned (Entropy = 0); red-haired is 1 person, not sunburned (En- method that iteratively adjusts the weight of observation as per the
components. tropy = 0). Compute the Information Gain for the attribute ”Hair” last classification. If an instance is incorrectly classified, it increases
x−min(x) using: its weight. The term ‘boosting’ refers to methods that convert a weak
Normalization: Transform values to x′ = max(x)−min(x) .
4 3 1 learner to a stronger one. It usually decreases the bias error and builds
Standardization: Shift values by mean and scale by standard devi- IG(Hair) = 0.9544 − ×1+ ×0+ ×0 strong predictive models.
ation: z = x−µ . 8 8 8
σ Distance Functions:
Pearson’s Correlation: Solution: IG(Hair) = 0.4544
P Bayesian Learning: Classify based on feature frequencies, assuming
(xi − x̄)(yi − ȳ) independence. v
R = pP u d d
(xi − x̄)2 (yi − ȳ)2 Naive Bayes: Naı̈ve Bayes does not impose any restrictions on the
P uX X
number of decision classes to be produced, although the example in Euclidean(x, y) = t (xi − yi )2 , Manhattan(x, y) = |xi − yi |
the lecture referred to a binary classification problem. The remaining i=1 i=1
Chi-squared:
X (Oij − Eij )2 options hold (this supervised learning method assumes that features d d
! p1
χ2 = have the same importance and are independent).
Hamm.(x, y) =
X
I(xi ̸= yi ), Mink.(x, y, p) =
X
|xi − yi | p
Eij
i=1 i=1
rowi ×colj P (H) × P (E|H)
where Eij = .
total
P (H|E) =
Example: Consider the contingency table: P (E)
Week 3 Evaluation and model selection: Generalization Ca-
Dogs Cats Rabbits Example: P r[E1 |H] might be P r[animal = cat|pet = yes]. pability: Model’s performance on unseen data. Hyperparameters
Shelter A 5 3 2 Given a dataset with the following entries for playing outside based regulate model construction.
Shelter B 6 8 4 on weather outlook and temperature:
Outlook Temperature Play Performance Measures assess model’s predictive capabilities.
To calculate χ2 : 1. Compute row, column and total sums. 2. Cal- Sunny Hot Yes Data Splitting: Hold-out splits dataset into training, validation,
rowi ×colj Overcast Mild Yes and test. Risks: Improper representation, unrealistically high results.
culate expected frequencies: Eij = . 3. Plug into the chi-
total Rainy Cool No
squared formula. Sunny Mild Yes Stratification matches sample’s class distribution to the entire pop-
Using the formula, for dogs at Shelter A: Sunny Cool Yes ulation.
Rainy Mild No K-fold Cross-validation averages results over k iterations using k-1
(5 + 6) × (5 + 3) 11 × 8 From the data, we derive the probabilities: P (Play=Yes) = for training.
E11 = = = 3.93
28 28 4
P (Play=No) = 2
P (Outlook=Sunny—Play=Yes) = 3
Nested K-fold CV: Inner CV used to tune hyperparameters (vali-
6 6 4
1 dation) to get best model for outer CV.
Continue this process for all cells. P (Temperature=Cool—Play=Yes) = 4
The chi-squared statistic is: Using the Naı̈ve Bayes formula for the likelihood someone Hyperparameter Tuning: Grid Search: All combinations. Ran-
will play outside when the outlook is sunny and temperature dom Search: Subset of combinations.
X (Oij − Eij )2 is cool: P (Play=Yes—Outlook=Sunny, Temperature=Cool) =
χ2 = P (Outlook=Sunny—Play=Yes)×P (Temperature=Cool—Play=Yes)× Confusion Matrix:
i,j
Eij P (Play=Yes)
Substituting in the values:
Predict Cat Predict Dog
Compute χ2 for all animals and sum them to get the final statistic. P (Play=Yes|Outlook=Sunny, Temperature=Cool) = 43 × 14 × 46 = 81 True Cat TP FN
Encoding: Label Encoding for ordinal data, e.g., education level; Then do the same for P (Play=NO—Outlook=Sunny, Temperature=Cool) True Dog FP TN
One-hot Encoding for nominal data, e.g., cat/dog/rabbit. Under- and normalize to get a result between 0 and 1.
sampling for large datasets can cause data loss and lower accuracy; Lazy Learning: Similarity function crucial. KNN uses K-nearest
Oversampling can lead to overfitting with small datasets. neighbors and a distance function. Sensitive to outliers and distance
function selection.
Week 2 Pattern classification: Rule-based Learning: Classify Random Forest: Aggregates decision trees’ outputs through major- TP + TN TP
with rules from features and values. Decision trees are popular. Data ity vote. More trees may overfit. uses bagging Shallow decision trees Accuracy = , Recall =
splits based on high entropy attributes. T otal TP + FN
might lead to overly simple models unable to fit the data. A model
Entropy: that underfits will have high training and high test errors. Hence, poor TP (1 + β 2 ) × precision × recall
X P recision = , Fβ =
Entropy(P ) = − pi log pi performance on training and test sets indicates underfitting, which TP + FP β 2 × precision + recall
i means the hypotheses are not complex enough to include the true
but unknown prediction function. The shallower the tree, the less
Weighted Entropy: variance we have in our predictions. However, we can start to inject
too much bias at some point as shallow trees (e.g., stumps) cannot Bias-Variance Trade-off: Bias (fit) low bias fits data well. High
capture interactions and complex patterns present in our data. Bag- bias → model is too simple for the data (underfitting).) vs. Variance
X xi
inf o(x) = × entropyi
nobs ging generates additional data for training from the dataset (using (consistency training to test) High variance → models the random
random sampling with replacement). Every element is equally prob- noise in the training data (overfitting).
Information Gain: able to be selected. Such datasets are used to train multiple models Overfitting Solutions: Pruning in Trees: Simplify by removing less
in parallel. The average of all the predictions from different ensemble impactful subtrees. Higher k in kNN: Larger k reduces overfitting
gain(fi ) = inf o(root) − inf o(fi ) models is calculated. The decision class resulting from a majority vote risk.
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
√ Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper giyantogoossens. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €3,99. Je zit daarna nergens aan vast.