Recap before midterm
What is data mining?
(slides) Data mining is the computational process of discovering patterns
in large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics and database systems.
(google) Data mining is searching for patterns in data. In exact words,
data is the actual extraction of knowledge from data via technologies that
incorporate these principles.
(slides Chris) Data mining is a concept to unify statistics, data analysis
and their related methods in order to understand and analyze actual
phenomena with data.
With data mining, we want to prove that something can be predicted
better than the baseline, or that a certain method works better than a
method that has been explored before.
What are the related disciplines?
The related disciplines that have overlap with data mining are;
1. Artificial Intelligence (AI): interdisciplinary field aiming to develop
intelligent machines
2. Machine Learning (ML): branch of computer science studying
learning from data
3. Statistics: branch of mathematics focused on data
4. Information retrieval/knowledge discovery in databases
Others are;
,
,What are the applications?
In companies, data mining is applied as business intelligence (market
analysis and management).
In science, data mining is applied as knowledge discovery (scientific
discovery in large data). In science, also text mining (natural language
processing) is used, which is going form unstructured text to structured
knowledge.
What is big data?
(slides) Big data consists of three parts;
1. Volume: data that is too big for manual analysis, too big to fit in
RAM and too big to store on disk.
2. Variety: big data has high ranges of values (variance), has outliers,
confounders and noise, and consists of different data types.
3. Velocity: big data changes quickly (require results before data
changes) and big data is streaming data (no storage).
(readings) Datasets that are too large for traditional data-processing
systems and that therefore require new technology. There is big data 1.0
(businesses got the basic internet technologies in place so that they could
establish a web presence, build electronic commerce capability and
improve operating efficiency. With big data 2.0, new systems and
companies started to exploit the interactive nature of the web. The
changes brought on by this shift in thinking are extensive and pervasive;
the most obvious are the incorporation of social-networking components
and the rise of the ‘voice’ of the individual consumer and citizen.
Different types of learning: supervised and unsupervised
Supervised learning (classification, regression) is done using a ground
truth; we have prior knowledge of what the output values of our samples
should be. The goal of supervised learning is to learn a function that,
given a sample of data and desired outputs, best approximates the
relationship between input and output observable in the data. Supervised
, learning means that the data is labeled. In supervised learning, you know
x and y.
Unsupervised learning (clustering, dimensionality reduction) does not
have labeled outputs, so its goal is to infer the natural structure present
within a set of data points. Unsupervised learning means that the data is
not labeled, we want to find patterns within the data. In unsupervised
learning, you know only x (you do not know yet what to research). In
short, unsupervised learning can be defined as data mining algorithms
that infer patterns from a dataset without reference to outcomes or
decisions.
Semi-supervised classification is a combination of both. It means that
we have some instances we shall attach to the decision classes, and we
have a small amount of labeled data with a large amount of unlabeled
data.
Examples of supervised and unsupervised learning (regression,
classification, clustering, dimensionality reduction)
Supervised: regression, classification (3 parts; input, output and function)
Unsupervised: clustering, dimensionality reduction
Workflow of supervised learning
1. Collect data
2. Label examples
3. Choose representation (features are numerical or categorical,
possibly convert to feature vector)
4. Train models (use a training set for learning, and a validation
set for tuning. hyperparameters are settings of learning
algorithms. For each value of hyperparameters, apply
algorithm to training set to learn, check performance on
validation set and find the best-performing setting)
5. Evaluate (check performance of tuned model on test set. You
want to estimate how well your model will be do in the real
world).