Data mining for business and governance
Week 1
What is data mining: the study of collecting, cleaning, processing, analyzing and gaining
useful insight of information.
- It is an umbrella term and the methods used relates to different disciplines:
o Knowledge discovery in databases
o Statistics
o Artificial intelligence
o Machine learning
Key aspects:
- Computation vs large data sets: trade-off between processing time and memory.
- Computation enables analysis of the large data sets: computers as a tool and with
growing data.
- Data mining often implies knowledge discovery from databases: from unstructured
data to structured knowledge.
What are large amounts or big data:
- Volume
o Too big for manual analysis
o Too big to fit in RAM
o Too big to store on disk
- Variety
o Range of values: variance
o Outliers, confounders and noise
o Different data types
- Velocity
o Data changes quickly: require results before data changes
o Streaming data (no storage)
It is not only about volume but also about complexity (variety) and for example the speed of
the database.
Applications of data mining:
- Companies: business intelligence
- Science: knowledge discovery
,Datapoints could be observations that you have in a dataset. They could be related
(dependent) or independent.
Dependency oriented data types: explicit, like in social networks, relationships can be
captured by edges in graph representations, or implicit like temperature readings of a
sensor, similar values, this relationship is not market explicitly:
- Spatial data, spatio temporal data
- Network data
- Time series data
- String data (text)
- Discrete sequences (event logs)
The dependency properties relate for example to the assumptions of the methods that can
be applied.
In short: Implicit data is information that is not provided intentionally but gathered from
available data streams, either directly or through analysis of explicit data. Explicit data is
information that is provided intentionally, for example through surveys and membership
registration forms.
Non dependency oriented: no specified dependency records:
- Multidimensional data, can be quantitative, categorical, binary (can be seen as
quantitative or categorical)
- Text data with a representation ignoring the relationship, for example looking into
the frequency of words.
For machine learning models, observations are assumed to be independent of each other.
,The general pipeline:
What makes prediction possible?
- Fitting data is easy, but predictions are hard
- Associations between features/target
o Numerical: correlation coefficient
o Categorical: mutual information value of x1 contains information about value
of x2
Statistical descriptions of data:
Measures of central tendency:
- Mean: average
- Median: the middle vale in a set of ordered data values
- Mode: value that occurs most frequently in the set
Statistical descriptions are relevant for summarizing the data, for having an overview when
we are exploring the data.
, Measuring the spread of data, five number summary:
- Range: (max()) – (min())
- Quantiles: points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets. The second quantile is the median, the 4
quantiles are quartiles (3 data points Q1, Q2, Q3), and 100 quartiles are percentiles.
- Interquartile range: IQR= Q3 – Q1
Basic plots: Box plot: includes Q1, median, Q3, min and max values as well as outliers, points
are at least 1,5 IQR further away from Q1 and Q3. It shows you information about how the
data is spread.
When there are outliers, instead of min max, points that are at 1,5IQR distance from the
median are used instead of the min max values for the whisks (horizontal lines outside the
box). Box plots allow us to compare distributions of several features.
Measuring the dispersion of data:
- Variance standard deviation
Basic plot: scatter plot: when you have two features. It gives you an overview about how the
data is distributed over the x, y plane.
Correlation: measure of how two features are moving together. There is also a coefficient
related to that that you can calculate.
- Pearson’s r measures the strength of linear relationship (dependency)