Examen

Uitwerking examenvragen data mining

23 vues 0 fois vendu

Cours
Data Mining

Établissement
Katholieke Universiteit Leuven (KU Leuven)

Uitwerking van oudere examenvragen van studenten met feedback.

[Montrer plus]

Aperçu 3 sur 16 pages

Voir l'exemple

Publié le 21 décembre 2023
Nombre de pages 16
Écrit en 2022/2023
Type Examen
Contient Questions et réponses

sepm13 Membre depuis 2 année 29 documents vendus

€10,89

Ajouter au panier

Ajouter au liste de veux

Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien

Date Download: 19/06/2023

Data Mining Exam Questions
Link:
https://docs.google.com/document/d/1P2Za3RewqiRAVlJkEUFZPb1T3H82_3w9AeuhwVeF
vKQ/edit?fbclid=IwAR1lmzkov2kXUnQ-
HWm6LWXP_qW7kWKnwpWPOgeXxbevJvL9Xo0QAFhmqJA

Other questions can be found here: https://wiki.vtk.be/Data_Mining_(H02C6A)
No real reason to create an account - mostly the same and no more detail than the Qs here
→ just be sure to add new questions/exams on the wiki for continuity reasons :)

If a question is answered and confirmed to be correct, mark it green.

If a question is answered but not confirmed to be correct, mark it yellow.

If a question is open and has no answer yet, mark it red.

There is a fixed formula sheet that is provided for you during the exam and it can be found
on Toledo as well (it does not contain all formulas though)

2022 July
1. Logistic regression weight update
2. PCY exercise
3. Calc recommendation of movies and user, with latent factor model -> WE NEED AN
EXAMPLE
4. 5 small questions testing your insights
5. Anomaly detection: You are given a series of graphs for each day (x-axis: time, y-
axis: amount of visitors on a website).
a. Is there anything unusual about the data (For a specific day in the fall the
amount of visitors was double at midnight)
b. If there is anything unusual about the data, is this an anomaly or normal but
unusual behaviour? (It was an anomaly due to the switch from daylight saving
to standard time if i remember correctly)
6. 5 small questions testing your insights.
a. One was about active learning
7. BIRCH vs CURE: Given a set of points, Show how BIRCH (only ellipsoids)/CURE
(can take more complex shapes) would cluster these points (2 clusters)
8. Google created a model in 2008 to predict flu outbreaks by looking at google
searches. The model was fairly accurate up until 2013, afterwards it started
overestimating flu cases, why? I think it might have to do with the rise of social
media, many articles about potential flu outbreaks cause people to search more
about the flu causing the model to overestimate. Correlation != Causation

2022 June
1. Logistic regression weight update

,Date Download: 19/06/2023

2. Max miner exercise
3. Bi projection exercise (I think?)
4. K means vs GMM (same as 2022 Jan)
5. 5 small questions testing your insights
6. Knn for anomalies (not sure)
7. A table of vaccination rates at different age groups. What are 2 potential problems
with this data? Something about simpson's paradox

2022 jan
1. Logistic regression -> but with gradient descent (does this mean we also have to flip
the objective function (multiply L by (-1))) yes
Chat gpt: logistic regression can be trained using various optimization algorithms,
and gradient descent is one of them. Gradient descent is a common optimization
algorithm used to find the optimal parameters for logistic regression, but it is not the
only option.

Logistic regression aims to model the probability of a binary outcome based on a set
of input features. The model applies a logistic (sigmoid) function to a linear
combination of the features to map the continuous input space to a probability
between 0 and 1. The parameters of the logistic regression model are estimated to
maximize the likelihood of the observed data.

Gradient descent is an iterative optimization algorithm that adjusts the model
parameters in the direction of the steepest descent of the loss function. In logistic
regression, the loss function is typically the log-likelihood or the negative log-
likelihood. By taking steps proportional to the negative gradient of the loss function,
gradient descent iteratively updates the parameters until convergence to the optimal
solution.

2. Bilevel projection of
Sequence DB: 10:<c(ad)a>,20:<d(ac)da>,30<c(cd)a(ac)>
What is this? → look at the last lines of sequence mining
3. Max miner algo

4. K means vs GMM

Both 2 clusters -> where would X1 and X2 be after 1 iteration of clustering from these
starting points given the data (for K means and for GMM)

, Date Download: 19/06/2023

How can you estimate this for the GMM case?

Someone who know what the GMM would look like?
=> EM clusering example in slides, plot it out

=> There should be an intuitive way of doing this, no :((((? HELP

5. Short questions (only know the answers not the question)
a. LR and overfitting
b. GBRT with a small LR
c. Run learning algo on data with actively acquired labels
d. Drawback to toivonens algorithm

6. Question DTW( diagram and how to improve DTW to prevent noise)
Someone who knows the answer to this?
There are slides on Longest Common Subsequence (LCSS) that tackle the noise
problem by allowing for gaps. It includes the algorithm and example.

7. KNN for outliers(rank the points from most to least anomalous)
8. Like slide (Some Data Puzzels) p54-55 the table

2021-07-18

1. Exercise on the generate + prune step of apriori (single iteration)
2. Compute LCSS (Time series)
3. Predict movie ratings using collaborative filtering
4. Exercise on complete link agglomerative clustering
5. GMM: rank the points from most to least anomalous
Data is represented by a mixture of Gaussian ⇒ each example x has a probability p(x)
of being generated by the GMM
High p(x) → GMM is probable to generate this sample x → no anomaly
Low p(x) → GMM is unlikely to generate the sample x → anomaly
How can he ask this? Given alpha and probabilities of x belonging to a cluster?
Anomaly detection -> slide 13 → This is kNN for anomalies tho… for distances farthest away
is most anomalous. For GMM / probabilities you want to order from low to high (low chance
to generate this, so hence highly likely anomalous)
6. Convert the data from a training set into the proper format for logistic regression
What are we supposed to do here?
I guess this is related to the fact that logistic regression methods require the input data to be
numerical and therefore you need to convert categorical variables into indicator variables
(dummy coding)
So e.g. when you have data with labels (small, medium large) you can convert it to (0,1,2)?

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur sepm13. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €10,89. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

50843 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 14 ans

Commencez à vendre!

Populaire universiteiten

Populaire hogescholen

Populaire studieboeken voor Communicatie en Taal

Populaire studieboeken voor Economie en Bedrijf

Populaire studieboeken voor Exact en Informatica

Populaire studieboeken voor Gedrag en Maatschappij

Populaire studieboeken voor Gezondheid en Geneeskunde

Populaire studieboeken voor Recht en Bestuur

Vendeur