Resume

Summary Cheat Sheet including all equations and examples 2023/2024

Name: Cheat Sheet including all equations and examples 2023/2024
SKU: doc_3630866
Rating: 5.00 (1 reviews)
Author: giyantogoossens

1 vérifier

153 vues 5 fois vendu

Cours
Data Mining for Business and Governance (880022M6)

Établissement
Tilburg University (UVT)

Maximize your chances of acing the exam with this meticulously crafted cheat sheet. Designed specifically for exam success, it distills complex data mining concepts into easy-to-understand formulas and techniques, including the XAI lecture. Tailored to fit exam guidelines, this sheet is your allowe...

[Montrer plus]

Aperçu 1 sur 2 pages

Voir l'exemple

Publié le 16 octobre 2023
Nombre de pages 2
Écrit en 2023/2024
Type Resume

decision trees
entropy calculations
naive bayes classifier
conditional probability
chi squared
gini impurity
information gain
feature selection
explainable artificial intelligence

Établissement
Tilburg University (UVT)
Cours
Data science and Society
Cours
Data Mining for Business and Governance (880022M6)

1 vérifier

Par: tygovandenherik1 • 2 semaines de cela

giyantogoossens

Membre depuis 1 année 7 documents vendus

€3,99

Ajouté

Ajouter au panier

Ajouter au liste de veux

Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien

Week 1: Missing Values: Remove Features if majority of in- Example: Consider dataset with sunburned people: 3 out of 8 get sun- is assigned to the instance. Overall, bagging decreases the variance.
stances are missing; Removing Instances is not ideal with limited burned (Entropy = 0.9544). For hair color, blond has 4 people with a Boosting: Prioritize misclassified instances by weight adjustment.
data; Replacing with mean, median, or mode introduces noise; Au- split of 2 sunburned, 2 not (Entropy = 1); brown has 3 people, none Retrain model till criterion is met. Boosting is a sequential ensemble
toencoders are unsupervised neural nets with Encoder & Decoder sunburned (Entropy = 0); red-haired is 1 person, not sunburned (En- method that iteratively adjusts the weight of observation as per the
components. tropy = 0). Compute the Information Gain for the attribute ”Hair” last classification. If an instance is incorrectly classified, it increases
x−min(x) using: its weight. The term ‘boosting’ refers to methods that convert a weak
Normalization: Transform values to x′ = max(x)−min(x) .
4 3 1 learner to a stronger one. It usually decreases the bias error and builds
Standardization: Shift values by mean and scale by standard devi- IG(Hair) = 0.9544 − ×1+ ×0+ ×0 strong predictive models.
ation: z = x−µ . 8 8 8
σ Distance Functions:
Pearson’s Correlation: Solution: IG(Hair) = 0.4544
P Bayesian Learning: Classify based on feature frequencies, assuming
(xi − x̄)(yi − ȳ) independence. v
R = pP u d d
(xi − x̄)2 (yi − ȳ)2 Naive Bayes: Naı̈ve Bayes does not impose any restrictions on the
P uX X
number of decision classes to be produced, although the example in Euclidean(x, y) = t (xi − yi )2 , Manhattan(x, y) = |xi − yi |
the lecture referred to a binary classification problem. The remaining i=1 i=1
Chi-squared:
X (Oij − Eij )2 options hold (this supervised learning method assumes that features d d
! p1
χ2 = have the same importance and are independent).
Hamm.(x, y) =
X
I(xi ̸= yi ), Mink.(x, y, p) =
X
|xi − yi | p
Eij
i=1 i=1
rowi ×colj P (H) × P (E|H)
where Eij = .
total
P (H|E) =
Example: Consider the contingency table: P (E)
Week 3 Evaluation and model selection: Generalization Ca-
Dogs Cats Rabbits Example: P r[E1 |H] might be P r[animal = cat|pet = yes]. pability: Model’s performance on unseen data. Hyperparameters
Shelter A 5 3 2 Given a dataset with the following entries for playing outside based regulate model construction.
Shelter B 6 8 4 on weather outlook and temperature:
Outlook Temperature Play Performance Measures assess model’s predictive capabilities.
To calculate χ2 : 1. Compute row, column and total sums. 2. Cal- Sunny Hot Yes Data Splitting: Hold-out splits dataset into training, validation,
rowi ×colj Overcast Mild Yes and test. Risks: Improper representation, unrealistically high results.
culate expected frequencies: Eij = . 3. Plug into the chi-
total Rainy Cool No
squared formula. Sunny Mild Yes Stratification matches sample’s class distribution to the entire pop-
Using the formula, for dogs at Shelter A: Sunny Cool Yes ulation.
Rainy Mild No K-fold Cross-validation averages results over k iterations using k-1
(5 + 6) × (5 + 3) 11 × 8 From the data, we derive the probabilities: P (Play=Yes) = for training.
E11 = = = 3.93
28 28 4
P (Play=No) = 2
P (Outlook=Sunny—Play=Yes) = 3
Nested K-fold CV: Inner CV used to tune hyperparameters (vali-
6 6 4
1 dation) to get best model for outer CV.
Continue this process for all cells. P (Temperature=Cool—Play=Yes) = 4
The chi-squared statistic is: Using the Naı̈ve Bayes formula for the likelihood someone Hyperparameter Tuning: Grid Search: All combinations. Ran-
will play outside when the outlook is sunny and temperature dom Search: Subset of combinations.
X (Oij − Eij )2 is cool: P (Play=Yes—Outlook=Sunny, Temperature=Cool) =
χ2 = P (Outlook=Sunny—Play=Yes)×P (Temperature=Cool—Play=Yes)× Confusion Matrix:
i,j
Eij P (Play=Yes)
Substituting in the values:
Predict Cat Predict Dog
Compute χ2 for all animals and sum them to get the final statistic. P (Play=Yes|Outlook=Sunny, Temperature=Cool) = 43 × 14 × 46 = 81 True Cat TP FN
Encoding: Label Encoding for ordinal data, e.g., education level; Then do the same for P (Play=NO—Outlook=Sunny, Temperature=Cool) True Dog FP TN
One-hot Encoding for nominal data, e.g., cat/dog/rabbit. Under- and normalize to get a result between 0 and 1.
sampling for large datasets can cause data loss and lower accuracy; Lazy Learning: Similarity function crucial. KNN uses K-nearest
Oversampling can lead to overfitting with small datasets. neighbors and a distance function. Sensitive to outliers and distance
function selection.
Week 2 Pattern classification: Rule-based Learning: Classify Random Forest: Aggregates decision trees’ outputs through major- TP + TN TP
with rules from features and values. Decision trees are popular. Data ity vote. More trees may overfit. uses bagging Shallow decision trees Accuracy = , Recall =
splits based on high entropy attributes. T otal TP + FN
might lead to overly simple models unable to fit the data. A model
Entropy: that underfits will have high training and high test errors. Hence, poor TP (1 + β 2 ) × precision × recall
X P recision = , Fβ =
Entropy(P ) = − pi log pi performance on training and test sets indicates underfitting, which TP + FP β 2 × precision + recall
i means the hypotheses are not complex enough to include the true
but unknown prediction function. The shallower the tree, the less
Weighted Entropy: variance we have in our predictions. However, we can start to inject
too much bias at some point as shallow trees (e.g., stumps) cannot Bias-Variance Trade-off: Bias (fit) low bias fits data well. High
capture interactions and complex patterns present in our data. Bag- bias → model is too simple for the data (underfitting).) vs. Variance
X xi
inf o(x) = × entropyi
nobs ging generates additional data for training from the dataset (using (consistency training to test) High variance → models the random
random sampling with replacement). Every element is equally prob- noise in the training data (overfitting).
Information Gain: able to be selected. Such datasets are used to train multiple models Overfitting Solutions: Pruning in Trees: Simplify by removing less
in parallel. The average of all the predictions from different ensemble impactful subtrees. Higher k in kNN: Larger k reduces overfitting
gain(fi ) = inf o(root) − inf o(fi ) models is calculated. The decision class resulting from a majority vote risk.

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur giyantogoossens. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €3,99. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

84197 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 14 ans

Commencez à vendre!

Populaire universiteiten

Populaire hogescholen

Populaire studieboeken voor Communicatie en Taal

Populaire studieboeken voor Economie en Bedrijf

Populaire studieboeken voor Exact en Informatica

Populaire studieboeken voor Gedrag en Maatschappij

Populaire studieboeken voor Gezondheid en Geneeskunde

Populaire studieboeken voor Recht en Bestuur

Resume

Summary Cheat Sheet including all equations and examples 2023/2024

Infos sur le Document

Sujets

École, étude et sujet

1 vérifier

Vendeur

Avis reçus

Aperçu du contenu

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

L’achat facile et rapide

Focus sur l’essentiel

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Garantie de remboursement : comment ça marche ?

Auprès de qui est-ce que j'achète ce résumé ?

Est-ce que j'aurai un abonnement?

Peut-on faire confiance à Stuvia ?