100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Summary Data Mining for Business and Governance $4.86
Add to cart

Summary

Summary Data Mining for Business and Governance

 122 views  6 purchases
  • Course
  • Institution

English summary of the Data Mining fro Business and Governance course from Master in Data Science and Society. Summary of lecture materials, readings, and notes.

Preview 4 out of 66  pages

  • February 1, 2023
  • 66
  • 2022/2023
  • Summary
avatar-seller
Data mining for business and governance
Week 1

What is data mining: the study of collecting, cleaning, processing, analyzing and gaining
useful insight of information.
- It is an umbrella term and the methods used relates to different disciplines:
o Knowledge discovery in databases
o Statistics
o Artificial intelligence
o Machine learning

Key aspects:
- Computation vs large data sets: trade-off between processing time and memory.
- Computation enables analysis of the large data sets: computers as a tool and with
growing data.
- Data mining often implies knowledge discovery from databases: from unstructured
data to structured knowledge.

What are large amounts or big data:
- Volume
o Too big for manual analysis
o Too big to fit in RAM
o Too big to store on disk
- Variety
o Range of values: variance
o Outliers, confounders and noise
o Different data types
- Velocity
o Data changes quickly: require results before data changes
o Streaming data (no storage)
It is not only about volume but also about complexity (variety) and for example the speed of
the database.

Applications of data mining:
- Companies: business intelligence
- Science: knowledge discovery

,Datapoints could be observations that you have in a dataset. They could be related
(dependent) or independent.
Dependency oriented data types: explicit, like in social networks, relationships can be
captured by edges in graph representations, or implicit like temperature readings of a
sensor, similar values, this relationship is not market explicitly:
- Spatial data, spatio temporal data
- Network data
- Time series data
- String data (text)
- Discrete sequences (event logs)
The dependency properties relate for example to the assumptions of the methods that can
be applied.

In short: Implicit data is information that is not provided intentionally but gathered from
available data streams, either directly or through analysis of explicit data. Explicit data is
information that is provided intentionally, for example through surveys and membership
registration forms.

Non dependency oriented: no specified dependency records:
- Multidimensional data, can be quantitative, categorical, binary (can be seen as
quantitative or categorical)
- Text data with a representation ignoring the relationship, for example looking into
the frequency of words.
For machine learning models, observations are assumed to be independent of each other.

,The general pipeline:




What makes prediction possible?
- Fitting data is easy, but predictions are hard




- Associations between features/target
o Numerical: correlation coefficient
o Categorical: mutual information value of x1 contains information about value
of x2

Statistical descriptions of data:
Measures of central tendency:
- Mean: average
- Median: the middle vale in a set of ordered data values
- Mode: value that occurs most frequently in the set




Statistical descriptions are relevant for summarizing the data, for having an overview when
we are exploring the data.

, Measuring the spread of data, five number summary:
- Range: (max()) – (min())
- Quantiles: points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets. The second quantile is the median, the 4
quantiles are quartiles (3 data points Q1, Q2, Q3), and 100 quartiles are percentiles.
- Interquartile range: IQR= Q3 – Q1

Basic plots: Box plot: includes Q1, median, Q3, min and max values as well as outliers, points
are at least 1,5 IQR further away from Q1 and Q3. It shows you information about how the
data is spread.




When there are outliers, instead of min max, points that are at 1,5IQR distance from the
median are used instead of the min max values for the whisks (horizontal lines outside the
box). Box plots allow us to compare distributions of several features.

Measuring the dispersion of data:
- Variance standard deviation



Basic plot: scatter plot: when you have two features. It gives you an overview about how the
data is distributed over the x, y plane.

Correlation: measure of how two features are moving together. There is also a coefficient
related to that that you can calculate.
- Pearson’s r measures the strength of linear relationship (dependency)

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller liekebuuron. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.86. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

56326 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$4.86  6x  sold
  • (0)
Add to cart
Added