Data Mining Exam Questions
Link:
https://docs.google.com/document/d/1P2Za3RewqiRAVlJkEUFZPb1T3H82_3w9AeuhwVeF
vKQ/edit?fbclid=IwAR1lmzkov2kXUnQ-
HWm6LWXP_qW7kWKnwpWPOgeXxbevJvL9Xo0QAFhmqJA
Other questions can be found here: https://wiki.vtk.be/Data_Mining_(H02C6A)
No real reason to create an account - mostly the same and no more detail than the Qs here
→ just be sure to add new questions/exams on the wiki for continuity reasons :)
If a question is answered and confirmed to be correct, mark it green.
If a question is answered but not confirmed to be correct, mark it yellow.
If a question is open and has no answer yet, mark it red.
There is a fixed formula sheet that is provided for you during the exam and it can be found
on Toledo as well (it does not contain all formulas though)
2022 July
1. Logistic regression weight update
2. PCY exercise
3. Calc recommendation of movies and user, with latent factor model -> WE NEED AN
EXAMPLE
4. 5 small questions testing your insights
5. Anomaly detection: You are given a series of graphs for each day (x-axis: time, y-
axis: amount of visitors on a website).
a. Is there anything unusual about the data (For a specific day in the fall the
amount of visitors was double at midnight)
b. If there is anything unusual about the data, is this an anomaly or normal but
unusual behaviour? (It was an anomaly due to the switch from daylight saving
to standard time if i remember correctly)
6. 5 small questions testing your insights.
a. One was about active learning
7. BIRCH vs CURE: Given a set of points, Show how BIRCH (only ellipsoids)/CURE
(can take more complex shapes) would cluster these points (2 clusters)
8. Google created a model in 2008 to predict flu outbreaks by looking at google
searches. The model was fairly accurate up until 2013, afterwards it started
overestimating flu cases, why? I think it might have to do with the rise of social
media, many articles about potential flu outbreaks cause people to search more
about the flu causing the model to overestimate. Correlation != Causation
2022 June
1. Logistic regression weight update
,Date Download: 19/06/2023
2. Max miner exercise
3. Bi projection exercise (I think?)
4. K means vs GMM (same as 2022 Jan)
5. 5 small questions testing your insights
6. Knn for anomalies (not sure)
7. A table of vaccination rates at different age groups. What are 2 potential problems
with this data? Something about simpson's paradox
2022 jan
1. Logistic regression -> but with gradient descent (does this mean we also have to flip
the objective function (multiply L by (-1))) yes
Chat gpt: logistic regression can be trained using various optimization algorithms,
and gradient descent is one of them. Gradient descent is a common optimization
algorithm used to find the optimal parameters for logistic regression, but it is not the
only option.
Logistic regression aims to model the probability of a binary outcome based on a set
of input features. The model applies a logistic (sigmoid) function to a linear
combination of the features to map the continuous input space to a probability
between 0 and 1. The parameters of the logistic regression model are estimated to
maximize the likelihood of the observed data.
Gradient descent is an iterative optimization algorithm that adjusts the model
parameters in the direction of the steepest descent of the loss function. In logistic
regression, the loss function is typically the log-likelihood or the negative log-
likelihood. By taking steps proportional to the negative gradient of the loss function,
gradient descent iteratively updates the parameters until convergence to the optimal
solution.
2. Bilevel projection of
Sequence DB: 10:<c(ad)a>,20:<d(ac)da>,30<c(cd)a(ac)>
What is this? → look at the last lines of sequence mining
3. Max miner algo
4. K means vs GMM
Both 2 clusters -> where would X1 and X2 be after 1 iteration of clustering from these
starting points given the data (for K means and for GMM)
, Date Download: 19/06/2023
How can you estimate this for the GMM case?
Someone who know what the GMM would look like?
=> EM clusering example in slides, plot it out
=> There should be an intuitive way of doing this, no :((((? HELP
5. Short questions (only know the answers not the question)
a. LR and overfitting
b. GBRT with a small LR
c. Run learning algo on data with actively acquired labels
d. Drawback to toivonens algorithm
6. Question DTW( diagram and how to improve DTW to prevent noise)
Someone who knows the answer to this?
There are slides on Longest Common Subsequence (LCSS) that tackle the noise
problem by allowing for gaps. It includes the algorithm and example.
7. KNN for outliers(rank the points from most to least anomalous)
8. Like slide (Some Data Puzzels) p54-55 the table
2021-07-18
1. Exercise on the generate + prune step of apriori (single iteration)
2. Compute LCSS (Time series)
3. Predict movie ratings using collaborative filtering
4. Exercise on complete link agglomerative clustering
5. GMM: rank the points from most to least anomalous
Data is represented by a mixture of Gaussian ⇒ each example x has a probability p(x)
of being generated by the GMM
High p(x) → GMM is probable to generate this sample x → no anomaly
Low p(x) → GMM is unlikely to generate the sample x → anomaly
How can he ask this? Given alpha and probabilities of x belonging to a cluster?
Anomaly detection -> slide 13 → This is kNN for anomalies tho… for distances farthest away
is most anomalous. For GMM / probabilities you want to order from low to high (low chance
to generate this, so hence highly likely anomalous)
6. Convert the data from a training set into the proper format for logistic regression
What are we supposed to do here?
I guess this is related to the fact that logistic regression methods require the input data to be
numerical and therefore you need to convert categorical variables into indicator variables
(dummy coding)
So e.g. when you have data with labels (small, medium large) you can convert it to (0,1,2)?
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
√ Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper sepm13. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €10,89. Je zit daarna nergens aan vast.