100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Samenvatting Data Mining for buisiness and governance €5,49   In winkelwagen

Samenvatting

Samenvatting Data Mining for buisiness and governance

 68 keer bekeken  6 keer verkocht

Een gedetailleerde samenvatting over het vak data mining van alle collegeslides en aantekeningen inclusief voorbeelden van de algorithmes.

Voorbeeld 4 van de 69  pagina's

  • 22 maart 2021
  • 69
  • 2020/2021
  • Samenvatting
Alle documenten voor dit vak (6)
avatar-seller
Robinvanheesch
Samenvatting Data Mining.

Week 1.

Video 1: introduction.

Data mining has (main) relations to:

- Knowledge discovery in databases.
- Artificial intelligence.
- Machine learning.
- Stats.

Definition of data mining: data mining is the computational process of discovering patterns in large
datasets involving methods at the intersection of AI, machine learning, stats ,and database systems.

Key aspects of data mining:

- Computation VS large data sets → trade off between processing time and memory.
- Computation enables analyses of large datasets → computers as tools and with growing data.
- Data mining often implies knowledge discovery from data bases → from unstructured data to
structured knowledge

Big data = doing with volume, variety or velocity.

Data mining also can be seen as applied machine learning: it requires skill and as with most skills you
get better with practice and experience.

What makes prediction possible? Associations between features and targets. → numerical: correlation
coefficient, or categorical: mutual information of X1 contains information about value of X2.

There are two main types of learning:

1. Supervised learning (classification/regression).
2. Unsupervised learning (clustering/dimensionality reduction).

Main difference: using labels in you data or not. With unsupervised learning, you have no labels in your
data.

Supervised learning: training data = the portion of the data that you attend. In SL each object is a pair
(input and output). The algorithm searches for a function, and puts this into new data points to get the
output (do this in the test part, which says how good our fit is).

Workflow:

- Collect data.
- Label examples.
- Choose representation.
- Train model(s).
- Evaluate.
1. Collect data: reliability of measurement/privacy and other regulations.
2. Label examples: annotation guidelines.

, 3. Representation: there are features in your data that could be numerical or categorical. Possibly
convert to a feature vector → vector = a fixed size list of numbers → some learning algorithms
require examples represented as vectors.
4. Train: minimizing the difference between the target values in the dataset labels, and the
mapped values of your training examples. Keep some examples for final evaluation: test set.
Use the rest for: learning (training set) tuning (validation set).
Model tuning: is about finding the best values about the hyperparameters of the dataset. For
each value of hyperparameters:
• Apply algorithm to training set to learn.
• Check performance on validation set.
• Find/choose best performing setting.
5. Evaluate: check performance of the model on test set and this is about generalization of the
model: goal is to estimate how well your model will do in the real world. Keep evaluation
realistic.

Correlation coefficient is a measure, which measures the coral relationship between features. When
correlation is 1, line from left to right, when correlation is -1 line from right to left (up → down).

The numerator of correlation coefficient = the covariance, the denominator = product of STD.
covariance = to what extent the features change together. Product of STD = makes correlations
independent of units.

Covariance is a measure of the joint variability two variables (x, y). if these two are showing different
features, then the sign of the covariance shows the tendency of the linear relationship. It’s magnitude
is not easy to interpret. The correlation coefficient is normalized and corresponds to the strengths of
the linear relation.

Pearson’s R only measures linear tendency. If there is a R = 0 then that doesn’t mean these two features
are not related it just means it is not linear. Correlation does not imply causation but it may still enable
prediction.

Correlation VS causation. Possible causal relationships between two events A and B measured by
correlated random variables.

- A causes B.
- B causes A.
- C Causes both A and B.
- The correlation is a coincidence.
- Combination of the above.

Discovery of correlation can suggest a causal relationship but that is not necessarily the case.
Sometimes the causal relationship can only be discovered by an experimental study → looking into
variables, keeping everything else constant, does it change → this is hard.

Linear regression is a type of regression we have. 2 different variables, and we want to come up with
a relationship that predicts one variable, based on another variable.

,Regression analyses describes a relationship between random variables, those variables are IV (input)
and DV (output). In the regression model, the relationship doesn’t need not be in the form of a linear
function. We focus on linear: f(x) = Ax+B.

Regression is an example of supervised learning, another example could be classification we want to
look if data point shows classes and then if you pick out one datapoint, you have to look at to what
class that datapoint falls into. You assign a class to data points. Could be positive and negative class (1,
0). So, classification is about finding classes and related datapoints (naïve bays, KNN).

Decision boundaries are boundaries that distinguish different classes in classification tasks. The
boundaries are not always necessary linear lines. It is considered to be a model between the separation
of the two classes.

An example of unsupervised learning could be dimensionality reduction, or clustering.

Dimension reduction = the process of reducing the number of random variables under consideration,
while updating principle variables divide in:

- Feature selection = variable selection proves of selecting relevant features for use in model
construction.
- Feature extraction = define relevant features.

Clustering = when we have a set of data points but no things are together, we are supposed to group
thing together without having labels.

Video 1.

What is Data Science? Data science is a concept to unify statistics, data analysis and their related
methods in order to understand an analyze actual phenomena with data.

What makes a Data Scientist? Data scientists use their data and analytics ability to find and interpret
rich data sources, manage large amounts of data (…), create visualizations to aid in understanding data,
build mathematical models using the data, and present and communicate the data insights/findings.

There is a lot of related fields:




They all have one commonality: data-driven science.

What is data? You need to make data numerical and binary (1=yes, 0=no). Different units make it hard
to read the data.

Interpreting data.

, For the child whether interpretation: can we think of rules it’s play time?

We want to know if the kid wanted to play, and this is what we
call a target. We’ve been using certain points of info (features)
which we use to predict this target.




Formally:

- We have our data: X (with features, outlook, temp., windy).
- Our data exists of smaller instances, ‘some instance’ is written as x.
- If we want to specifically point at a particular instance (say our first row), we write x1. We can
see our model as a function f, that when given any instance x, gives us a prediction ^y.
- The application of the model to some instance in our data can be written as f(x).
- Our hope is that ^y is the same as our target: y. Y = a given (the truth) and ^y is a prediction.

In realistic cases, it is super important that we evaluate how these models that we make (algorithms),
that we know how they perform, and if they perform well on new data.

How do we know if our model performs well?

- Correct evaluation is incredibly important in data mining.
- We came up with some rules, but how do we know they generalize; if the rules we learned
apply with the same success rate t data where we don’t know what the target is.

For the child play example: we got 5/6 correct. Which means that the model has 83.3% accuracy. But,
did we cover all the predictions/what if we are presented with new conditions? Our rules are probably
too strict. Other than the training data (where we know the labels and determined our rules by), we
also need test data, unseen by us, to evaluate. We can use this test data o evaluate how our model
would perform on new data.

Case: prediction of house pries:

- Would you be able to determine the price of a house? → you need expert knowledge. This is
required by many observations to gain experience.
- Can you come up with a few features to predict the prices of a house?
• Amount of bedrooms;
• Big gardens yes or no;
• Good neighborhood.

How do we evaluate the house price? In the previous example, we had a clear binary prediction. Either
yes, or no. say we need more classes, we would still be predicting a nominal target (order does not
matter). What about a numeric target like house pricing? We can’t say now we got x% out of x%
correct, and therefore, we can’t use accuracy. Now we are more likely interested in how far our
prediction was off from the actual value: this is called error.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper Robinvanheesch. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 60904 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€5,49  6x  verkocht
  • (0)
  Kopen