100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary Big data management & Analytics. Grade: 8.8 €4,99
In winkelwagen

Samenvatting

Summary Big data management & Analytics. Grade: 8.8

2 beoordelingen
 180 keer bekeken  11 keer verkocht
  • Vak
  • Instelling
  • Boek

Summary of the course BDMA. Grade achieved: 8.8

Voorbeeld 4 van de 84  pagina's

  • Ja
  • 7 december 2020
  • 84
  • 2019/2020
  • Samenvatting

2  beoordelingen

review-writer-avatar

Door: ravdeepksingh • 1 jaar geleden

review-writer-avatar

Door: felienkarsten • 3 jaar geleden

avatar-seller
Summary Big Data Management and
Analytics
Book Data Science
Chapter 1
Data science involves principles, processes, and techniques for understanding phenomena via the
(automated) analysis of data. The ultimate goal is improving decision making.

Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data,
rather than purely on intuition. There are two sorts of decisions:

(1) Decisions for which “discoveries” need to be made within data
(2) Decisions that repeat, especially at massive scale, and so decision-making can benefit from
even small increases in decision-making accuracy based on data analysis.




There is a lot to data processing that is not data science—despite the impression one might get from
the media. Data engineering and processing are critical to support data science, but they are more
general.

 Data science needs access to data and it often benefits from sophisticated data engineering
that data processing technologies may facilitate, but these technologies are not data science
technologies per se.
 Data processing technologies are very important for many data-oriented business tasks that
do not involve extracting knowledge or data-driven decision-making, such as efficient
transaction processing, modern web system processing, and online advertising campaign
management.

Big data essentially means datasets that are too large for traditional data processing systems, and
therefore require new processing technologies. Used for:

 Data engineering
 Data mining
 But, most often: Data processing in support of data mining techniques and other data science
activities

1

,A fundamental strategy of data science is to acquire the necessary data at a cost. Once we view data
as a business asset, we should think about whether and how much we are willing to invest.

Four fundamental concepts of data science:

1. Extracting useful knowledge from data to solve business problems can be treated
systematically by following a process with reasonably well-defined stages.
2. From a large mass of data, information technology can be used to find informative
descriptive attributes of entities of interest.
3. If you look too hard at a set of data, you will find something—but it might not generalize
beyond the data you’re looking at.
4. Formulating data mining solutions and evaluating the results involves thinking carefully
about the context in which they will be used.

Chapter 2
Fundamental concepts: A set of canonical data mining tasks; The data mining process; Supervised
versus unsupervised data mining.

An important principle of data science is that data mining is a process with fairly wellunderstood
stages.

Examples of data mining algorithm tasks:

1. Classification and class probability estimation attempt to predict, for each individual in a
population, which of a (small) set of classes this individual belongs to. (E.g. “Among all the
customers of MegaTelCo, which are likely to respond to a given offer?”) In this example the
two classes could be called will respond and will not respond.
a. A closely related task is scoring or class probability estimation. A scoring model
applied to an individual produces, instead of a class prediction, a score representing
the probability that that individual belongs to each class.
2. Regression: (“value estimation”) attempts to estimate or predict, for each individual, the
numerical value of some variable for that individual. An example regression question would
be: “How much will a given customer use the service?”
a. Regression is related to classification, but the two are different. Informally,
classification predicts whether something will happen, whereas regression predicts
how much something will happen.
3. Similarity matching: attempts to identify similar individuals based on data known about
them. Similarity matching can be used directly to find similar entities. For example, IBM is
interested in finding companies similar to their best business customers, in order to focus
their sales force on the best opportunities.
4. Clustering: attempts to group individuals in a population together by their similarity, but not
driven by any specific purpose. An example clustering question would be: “Do our customers
form natural groups or segments?”


Supervised versus unsupervised methods: A vital part in the early stages of the data mining process
is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to
produce a precise definition of a target variable.

 Consider two similar questions we might ask about a customer population. The first is: “Do
our customers naturally fall into different groups?” Here no specific purpose or target has


2

, been specified for the grouping. When there is no such target, the data mining problem is
referred to as unsupervised.
o Example: Clustering
 Contrast this with a slightly different question: “Can we find groups of customers who have
particularly high likelihoods of canceling their service soon after their contracts expire?” Here
there is a specific target defined: will a customer leave when her contract expires? In this
case, segmentation is being done for a specific reason. This is called a supervised data mining
problem.
o Examples: Classification & Regression.

Cross Industry Standard Process for Data Mining




This process diagram makes explicit the fact that iteration is the rule rather than the exception.
Going through the process once without having solved the problem is, generally speaking, not a
failure.

Business Understanding

Initially, it is vital to understand the problem to be solved. This may seem obvious, but business
projects seldom come pre-packaged as clear and unambiguous data mining problems. Often
recasting the problem and designing a solution is an iterative process of discovery. The process
model represents this as cycles within a cycle, rather than as a simple linear process. The initial
formulation may not be complete or optimal so multiple iterations may be necessary for an
acceptable solution formulation to appear. In this first stage, the design team should think carefully
about the use scenario – What exactly do we want to do?

Data Understanding

If solving the business problem is the goal, the data comprise the available raw material from which
the solution will be built. It is important to understand the strengths and limitations of the data
because rarely is there an exact match with the problem. A critical part of the data understanding
phase is estimating the costs and benefits of each data source and deciding whether further
investment is merited. In data understanding we need to dig beneath the surface to uncover the

3

, structure of the business problem and the data that are available, and then match them to one or
more data mining task.

Data Preparation

A data preparation phase often proceeds along with data understanding, in which the data are
manipulated and converted into forms that yield better results. Typical examples of data preparation
are converting data to tabular format, removing or inferring missing values, and converting data to
different types.

Modeling

The output of modeling is some sort of model or pattern capturing regularities in the data. The
modeling stage is the primary place where data mining techniques are applied to the data.

Evaluation

The purpose of the evaluation stage is to assess the data mining results rigorously and to gain
confidence that they are valid and reliable before moving on. Equally important, the evaluation stage
also serves to help ensure that the model satisfies the original business goals. Recall that the primary
goal of data science for business is to support decision making.

A model may be extremely accurate (> 99%) by laboratory standards, but evaluation in the actual
business context may reveal that it still produces too many false alarms to be economically feasible.

Deployment

In deployment the results of data mining—and increasingly the data mining techniques themselves—
are put into real use in order to realize some return on investment. The clearest cases of deployment
involve implementing a predictive model in some information system or business process.

The main difference between data mining and other analytics techniques is that data mining focuses
on the automated search for knowledge, patterns, or regularities from data.

Chapter 3
Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute
selection.

Supervised segmentation: how can we segment the population into groups that differ from each
other with respect to some quantity of interest.

 One of the fundamental ideas of data mining: finding or selecting important, informative
variables or “attributes” of the entities described by the data.
o Information is a quantity that reduces uncertainty about something.
 Finding informative attributes also is the basis for a widely used predictive modeling
technique called tree induction. Tree induction incorporates the idea of supervised
segmentation in an elegant manner, repeatedly selecting informative attributes.

Supervised data mining can be divided into classification and regression.

Supervised learning is model creation where the model describes a relationship between a set of
selected variables (attributes or features) and a predefined variable called the target variable. The
model estimates the value of the target variable as a function (possibly a probabilistic function) of
the features.


4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

√  	Verzekerd van kwaliteit door reviews

√ Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jeremyut. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 53022 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€4,99  11x  verkocht
  • (2)
In winkelwagen
Toegevoegd