Resumen

Summary Cheat Sheet including all equations and examples 2023/2024

Name: Cheat Sheet including all equations and examples 2023/2024
SKU: doc_3630866
Rating: 5.00 (1 reviews)
Author: giyantogoossens

1 revisar

154 vistas 5 veces vendidas

Grado
Data Mining for Business and Governance (880022M6)

Institución
Tilburg University (UVT)

Maximize your chances of acing the exam with this meticulously crafted cheat sheet. Designed specifically for exam success, it distills complex data mining concepts into easy-to-understand formulas and techniques, including the XAI lecture. Tailored to fit exam guidelines, this sheet is your allowe...

[Mostrar más]

Vista previa 1 fuera de 2 páginas

Ver ejemplo

Subido en 16 de octubre de 2023
Número de páginas 2
Escrito en 2023/2024
Tipo Resumen

decision trees
entropy calculations
naive bayes classifier
conditional probability
chi squared
gini impurity
information gain
feature selection
explainable artificial intelligence

Institución
Tilburg University (UVT)
Estudio
Data science and Society
Grado
Data Mining for Business and Governance (880022M6)

1 revisar

Por: tygovandenherik1 • 1 mes hace

Seguir

giyantogoossens

Miembro desde 1 año 7 documentos vendidos

3,99 €

Añadido

Añadir al carrito

Añadir a la lista de deseos

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Week 1: Missing Values: Remove Features if majority of in- Example: Consider dataset with sunburned people: 3 out of 8 get sun- is assigned to the instance. Overall, bagging decreases the variance.
stances are missing; Removing Instances is not ideal with limited burned (Entropy = 0.9544). For hair color, blond has 4 people with a Boosting: Prioritize misclassified instances by weight adjustment.
data; Replacing with mean, median, or mode introduces noise; Au- split of 2 sunburned, 2 not (Entropy = 1); brown has 3 people, none Retrain model till criterion is met. Boosting is a sequential ensemble
toencoders are unsupervised neural nets with Encoder & Decoder sunburned (Entropy = 0); red-haired is 1 person, not sunburned (En- method that iteratively adjusts the weight of observation as per the
components. tropy = 0). Compute the Information Gain for the attribute ”Hair” last classification. If an instance is incorrectly classified, it increases
x−min(x) using: its weight. The term ‘boosting’ refers to methods that convert a weak
Normalization: Transform values to x′ = max(x)−min(x) .
4 3 1 learner to a stronger one. It usually decreases the bias error and builds
Standardization: Shift values by mean and scale by standard devi- IG(Hair) = 0.9544 − ×1+ ×0+ ×0 strong predictive models.
ation: z = x−µ . 8 8 8
σ Distance Functions:
Pearson’s Correlation: Solution: IG(Hair) = 0.4544
P Bayesian Learning: Classify based on feature frequencies, assuming
(xi − x̄)(yi − ȳ) independence. v
R = pP u d d
(xi − x̄)2 (yi − ȳ)2 Naive Bayes: Naı̈ve Bayes does not impose any restrictions on the
P uX X
number of decision classes to be produced, although the example in Euclidean(x, y) = t (xi − yi )2 , Manhattan(x, y) = |xi − yi |
the lecture referred to a binary classification problem. The remaining i=1 i=1
Chi-squared:
X (Oij − Eij )2 options hold (this supervised learning method assumes that features d d
! p1
χ2 = have the same importance and are independent).
Hamm.(x, y) =
X
I(xi ̸= yi ), Mink.(x, y, p) =
X
|xi − yi | p
Eij
i=1 i=1
rowi ×colj P (H) × P (E|H)
where Eij = .
total
P (H|E) =
Example: Consider the contingency table: P (E)
Week 3 Evaluation and model selection: Generalization Ca-
Dogs Cats Rabbits Example: P r[E1 |H] might be P r[animal = cat|pet = yes]. pability: Model’s performance on unseen data. Hyperparameters
Shelter A 5 3 2 Given a dataset with the following entries for playing outside based regulate model construction.
Shelter B 6 8 4 on weather outlook and temperature:
Outlook Temperature Play Performance Measures assess model’s predictive capabilities.
To calculate χ2 : 1. Compute row, column and total sums. 2. Cal- Sunny Hot Yes Data Splitting: Hold-out splits dataset into training, validation,
rowi ×colj Overcast Mild Yes and test. Risks: Improper representation, unrealistically high results.
culate expected frequencies: Eij = . 3. Plug into the chi-
total Rainy Cool No
squared formula. Sunny Mild Yes Stratification matches sample’s class distribution to the entire pop-
Using the formula, for dogs at Shelter A: Sunny Cool Yes ulation.
Rainy Mild No K-fold Cross-validation averages results over k iterations using k-1
(5 + 6) × (5 + 3) 11 × 8 From the data, we derive the probabilities: P (Play=Yes) = for training.
E11 = = = 3.93
28 28 4
P (Play=No) = 2
P (Outlook=Sunny—Play=Yes) = 3
Nested K-fold CV: Inner CV used to tune hyperparameters (vali-
6 6 4
1 dation) to get best model for outer CV.
Continue this process for all cells. P (Temperature=Cool—Play=Yes) = 4
The chi-squared statistic is: Using the Naı̈ve Bayes formula for the likelihood someone Hyperparameter Tuning: Grid Search: All combinations. Ran-
will play outside when the outlook is sunny and temperature dom Search: Subset of combinations.
X (Oij − Eij )2 is cool: P (Play=Yes—Outlook=Sunny, Temperature=Cool) =
χ2 = P (Outlook=Sunny—Play=Yes)×P (Temperature=Cool—Play=Yes)× Confusion Matrix:
i,j
Eij P (Play=Yes)
Substituting in the values:
Predict Cat Predict Dog
Compute χ2 for all animals and sum them to get the final statistic. P (Play=Yes|Outlook=Sunny, Temperature=Cool) = 43 × 14 × 46 = 81 True Cat TP FN
Encoding: Label Encoding for ordinal data, e.g., education level; Then do the same for P (Play=NO—Outlook=Sunny, Temperature=Cool) True Dog FP TN
One-hot Encoding for nominal data, e.g., cat/dog/rabbit. Under- and normalize to get a result between 0 and 1.
sampling for large datasets can cause data loss and lower accuracy; Lazy Learning: Similarity function crucial. KNN uses K-nearest
Oversampling can lead to overfitting with small datasets. neighbors and a distance function. Sensitive to outliers and distance
function selection.
Week 2 Pattern classification: Rule-based Learning: Classify Random Forest: Aggregates decision trees’ outputs through major- TP + TN TP
with rules from features and values. Decision trees are popular. Data ity vote. More trees may overfit. uses bagging Shallow decision trees Accuracy = , Recall =
splits based on high entropy attributes. T otal TP + FN
might lead to overly simple models unable to fit the data. A model
Entropy: that underfits will have high training and high test errors. Hence, poor TP (1 + β 2 ) × precision × recall
X P recision = , Fβ =
Entropy(P ) = − pi log pi performance on training and test sets indicates underfitting, which TP + FP β 2 × precision + recall
i means the hypotheses are not complex enough to include the true
but unknown prediction function. The shallower the tree, the less
Weighted Entropy: variance we have in our predictions. However, we can start to inject
too much bias at some point as shallow trees (e.g., stumps) cannot Bias-Variance Trade-off: Bias (fit) low bias fits data well. High
capture interactions and complex patterns present in our data. Bag- bias → model is too simple for the data (underfitting).) vs. Variance
X xi
inf o(x) = × entropyi
nobs ging generates additional data for training from the dataset (using (consistency training to test) High variance → models the random
random sampling with replacement). Every element is equally prob- noise in the training data (overfitting).
Information Gain: able to be selected. Such datasets are used to train multiple models Overfitting Solutions: Pruning in Trees: Simplify by removing less
in parallel. The average of all the predictions from different ensemble impactful subtrees. Higher k in kNN: Larger k reduces overfitting
gain(fi ) = inf o(root) − inf o(fi ) models is calculated. The decision class resulting from a majority vote risk.

Los beneficios de comprar resúmenes en Stuvia estan en línea:

Garantiza la calidad de los comentarios

Compradores de Stuvia evaluaron más de 700.000 resúmenes. Así estas seguro que compras los mejores documentos!

Compra fácil y rápido

Puedes pagar rápidamente y en una vez con iDeal, tarjeta de crédito o con tu crédito de Stuvia. Sin tener que hacerte miembro.

Enfócate en lo más importante

Tus compañeros escriben los resúmenes. Por eso tienes la seguridad que tienes un resumen actual y confiable. Así llegas a la conclusión rapidamente!

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller giyantogoossens. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for 3,99 €. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

45,681 summaries were sold in the last 30 days

Founded in 2010, the go-to place to buy summaries for 14 years now

Empieza a vender

Institución educativa

Libros populares

Resumen

Summary Cheat Sheet including all equations and examples 2023/2024

Información del documento

Temas

Escuela, estudio y materia

1 revisar

Vendedor

Comentarios recibidos

Vista previa del contenido

Los beneficios de comprar resúmenes en Stuvia estan en línea:

Garantiza la calidad de los comentarios

Compra fácil y rápido

Enfócate en lo más importante

Preguntas frecuentes

What do I get when I buy this document?

100% de satisfacción garantizada: ¿Cómo funciona?

Who am I buying this summary from?

Will I be stuck with a subscription?

Can Stuvia be trusted?