100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten
logo-home
Summary Cheatsheet for Data Mining for Business and Governance Exam (2 pages) €5,49
In winkelwagen

Samenvatting

Summary Cheatsheet for Data Mining for Business and Governance Exam (2 pages)

 0 keer verkocht

Prepare effectively for your Data Mining for Business and Governance exam with this concise and structured cheatsheet. Spanning 2 pages, this resource is tailored for exam success, offering a quick reference guide organized by lecture topics. Featuring LaTeX-rendered mathematical formulas fo...

[Meer zien]

Voorbeeld 1 van de 2  pagina's

  • 27 mei 2024
  • 2
  • 2023/2024
  • Samenvatting
Alle documenten voor dit vak (1)
avatar-seller
binli
§ Basic Positively(R) skewed – mean > median > / union
mode; at least (1 − k12 )% no more than k sd of the train_df_scaled = scaler.fit_transform(train_df)
mean.Positively(R) skewed – mean > median > mode; test_df_scaled = scaler.transform(test_df)
at least (1 − k12 )% no more than k sd of the mean. § Evaluation and Model Selection out-of-sample evalu-
label encoding: assign integer numbers to each cate- ation. Optimizing hyperparameters. : three disjoint
gory. It only makes sense if there is an ordinal relationship sets:training, validation and test. Stratification:similar
among the categories. One-hot encoding: encode nom- class distribution. → k-fold cross-validation: mutually
inal features that lack an ordinal relationship; increases exclusive equal size subsets. nested k-fold CV: OIO.
the problem dimensionality. Class imbalance: Over- Hyperparameter tuning: random search. Bias: pre-
sampling; Undersampling; SMOTE(might induce noise); dictions - ground truth Variance: consistency in predic-
VarP = E(x2 ) − E(x)2 P-correlation = tions. complexity ↑ bias ↓ var↑ Decision tree pruning
(xi − x̄)(yi − ȳ) prepruning: node → leaf; postpruning: branches → leaf
pP ; χ2 association measure
(xi − x̄)2 (yi − ȳ)2
P
CV:cv_results = cross_validate(RandomForest
Pn Pn (Oij − Eij )2 pi × pj Classifier (random_state=42), X, y, cv=5) Grid
= i=1 j=1 ; Eij = Oij :observed Searchgrid_search = GridSearchCV(estimator=model,
Eij k
together; Eij : Expected value; param_grid= param_grid, cv=5)
Drop narows: df1.dropna(thresh=0.9*len(df), § XAI Interpretability: implicit capacity to explain
axis=1, inplace=True) Mean Imputation:df[’f’]. its reasoning process. Explainability: provide a jus-
fillna(mean_v, inplace=True) Normalization: tification for the predictions. Transparency: Algo-
sklearn.preprocessing.scaler = MinMaxScaler(); rithmic transparency, decomposability,and simulatabil-
df[’f’] = scaler.fit_transform(df[[’f’]]) Stan- ity. Intrinsically interpretable models: Linear re-
dardization: scaler = StandardScaler() La- gression, Decision tree, k-Nearest Neighbors. parsimo-
bel Encoding: encoder = LabelEncoder() nious (less is more). Post-hoc explanation methods:
df[’sex’] = encoder.fit_transform(df[’sex’]) Model-agnostic post-hoc: measure how the changes in the
label_encoder = LabelEncoder(); encoded_data = inputs affect the model’s outputs. 1 Partial dependency
label_encoder.fit_transform(Cancer_risk) plots. the marginaleffect of a feature on the model’s pre-
§ Classification Algorithms Rule-based learning: Deci- dictionwhen fixing the feature values. → average the class
sion Tree internal node: test on an attribute;branch: probabilities toa desired decision class. plot allows inspect
outcome of thetest; leaf node/terminal node: whether therelation between the feature and the target-
classPlabel; root node: topmost; entropy(P): = variable is monotonic, linear, etc. 2 Permutation fea-
−Pi i pi log2 Pi , measure of discorder(0 → pure). Infor- tureimportance: compute thefeature importance as the
mation value: weighted entropy. info gain: gain(fi ) = increase in themodel error when permuting the values
inf o(root) − inf o(fi ) Bayesian learning: assume ofthe feature being analyzed. Drawback: assume unre-
features are independent. Bayes’ theorem: P (Ci | alistic independency. 3 Shapley values (SHAP): com-
P (X | Ci ) · P (Ci ) putes the feature contribution. can be used in both lo-
X) = Naïve Bayes: P (X|Ci ) =
Qn P (X) cal and global contexts. cons: computationally expen-
k=1 P (xk |Ci ) = P (xi |Ci ) · P (x2 |Ci )...P (xn |Ci ) Normal- sive. 4 Local surrogates (LIME): generates synthetic
ization: P (C1 |X)/(P (C1 |X) + P (C2 |X)) The assump- instances around the small groups of instances. cons:
tions of independence and equalimportance of features are unstable 5 Global surrogates: approximate the behav-
rarely fulfilled. ior of the complex model with a a transparent model.
lazy learning: similar instancesshould lead to the same cons: describe the black-box model rather than problem.
decision classes. KNN: works well when theclasss are 6 Counterfactual explanations: describes the smallest
clearly sperated. odd k. sensitive tooutliers, the number changeto the feature values that produces adifferent de-
of neighbors andthe distance function. Minkowski:p=1- sired output Model-specific post-hoc: based on the rep-
Manhattan, p=1-Euclidean; Chebyshev: max difference; resentation structuresof the black-box models 1 Random
Cosine Similarity = cos(θ) Cosine Distance = 1 − cos(θ) Forests: compute the importanceof each problem feature
Ensemble learning: Bagging - bootstrap aggregation from their inner knowledge structures. cons: Feature im-
majority vote. Random Forest: build several decision portance based onimpurity can be misleading when fea-
trees, each using a randomselection (with replacement) tureshave many unique values. 2 Fuzzy Cognitive Maps:
of features and instances Boosting After a classifier Mi recurrent neural networks - neurons denote variables. Fea-
is learned, update the weights for difficult instances in ture importance is computed from theabsolute values of
next classifier Mi+1 Accuracy: (TP + TN) / all; Pre- weights connected toeach neuron in the network. cons:
cision = TP / (TP + FP); Recall =TP/(TP + FN); doesn’t consider activation values of neurons. Evalua-
Fβ = (1 + β 2 )pr/(β 2 p + r); Jaccard Index: IoU, overlap tion and measures Function level (number of rules of

1

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper binli. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 64670 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis
€5,49
  • (0)
In winkelwagen
Toegevoegd