Samenvatting

Summary- Data Wrangling (6012B0417Y)

Name: Summary- Data Wrangling (6012B0417Y)
SKU: doc_3695199
Rating: 4.00 (1 reviews)
Author: Egbert1976

1 beoordeling

4 keer verkocht

Vak
Data Wrangling (6012B0417Y)

Instelling
Universiteit Van Amsterdam (UvA)

In this english summary of the course Data Wrangling (2023) you will find everything to ace your upcoming exam for this course!

[Meer zien]

Voorbeeld 4 van de 47 pagina's

Bekijk voorbeeld

Geupload op 25 oktober 2023
Aantal pagina's 47
Geschreven in 2023/2024
Type Samenvatting

1 beoordeling

Door: rizzokingma • 3 maanden geleden

Volgen

Egbert1976 Lid sinds 3 jaar 7 documenten verkocht

€8,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Data Wrangling
Lecture 1A: Introduction to Data Science
Contents: Course Information, Introduction to Data Science, 6-step of Data Science, Business
Understanding, Data Preparation

Active and Engagement Learning: learning by doing.

- Weekly exercises: each week we apply the theory in practice on real-world applications.
- Main project about Data Science.

Why R Software?

- R is free and Open-source Tool.
- R is one of the leading tools for Data Science,
Statistics, and Machine Learning.
- R contains actual machine and statistical techniques;
new techniques are made available in R very quickly.

What is Data Science?

- Data Science uses tools and techniques to turn data
into meaningful business insights.
o Goal: Use data to take better business
decisions.
- Data science combines data analysis, statistics,
machine learning, and related methodology to
manage and understand the data deluge associated with the emergence of information
technology.
o A data deluge is a scenario where more data is generated than can be successfully
and efficiently managed or capped.
- Data scientists are tasked with presenting digital information in a way that describes its
practical value in data-driven decision-making.
- However, they don’t typically endeavour (try) to solve specific questions in the way that
business analysts do when seeking out business analytics insights.

Concepts

Data surfing: ponder (think about) questions-> Explore potential data.
➔ Data Wrangling: Gather Data-> Clean the Data-> Connect Data-> Transform Data.
➔ Data mining: Choose Algorithm-> use algorithm-> test algorithm-> refine algorithm.
➔ Data Artistry: Use Discovery-> Share Discovery-> Present Discovery.
o Data Art or data-driven art is an artistic practice that relies on the usage of a
dataset to convey emotions to the audience.

Data Wrangling

- The process of transforming ‘raw’ data into data that can be analysed to generate valid
actionable insights.
- Getting your data into a form that is natural to work with.
- Tidying (sort out) and transforming data.

,6 steps of Data Science

- 6-Step of Data Science is a general problem-solving strategy of business/research unit, which
is an adaptive life cycle.
o An adaptive life cycle involves iterative (repeatedly) and incremental (increasing)
development, allowing for flexibility and adaptability in response to changing
requirements and conditions.
o Also known as Cross-industry standard process for data mining (CRISP-DM)

1. Business/Research Understanding Phase
o Define project requirements and objectives.
o Translate objectives into a data analytics problem definition.
o Prepare a preliminary strategy to meet objectives.
2. Data Preparation Phase
o Clean and prepare data so it is ready for modelling tools.
o Perform transformation of certain variables, if needed.
3. Exploratory Data Analysis
o Analysing data to summarize their main characteristics, often with. Visual methods.
o Seeing what the data can tell us beyond the formal modelling or hypothesis testing
task.
4. Modelling
o Select and apply one or more modelling technique.
o Calibrate model settings to optimize result.
5. Evaluation
o Evaluate one or more models for effectiveness.
o Determine whether defined objectives are achieved.
6. Deployment
o Make use of the models created.

, o Simple deployment example: generate a report.
o In business, the customer often carries out the deployment based on your model.

Business Understanding: What is the problem that you are trying to solve?

- In business understanding stage, you need to:
1. Understand the business process.
2. Define project requirements and objectives.
3. Translate objectives into a data analytics problem definition.
4. Prepare a preliminary (introductory) strategy to meet objectives.

Data preparation

- Clean and prepare data so it is ready for modelling tools.
- Perform transformation of certain variables, if needed.
- Raw data are often unprocessed, incomplete, noisy and may contain:
o Obsolete/redundant fields
o Missing values
o Outliers
o Data in a form not suitable for data analysis
o Values not consistent with policy or common sense

Data preprocess

- For data analytics purposes, database values must
undergo data cleaning and data transformation.
- Minimize GIGO (Garbage In-> Garbage Out).
o If GIGO is minimized-> then Garbage results
Out from model is minimized.
- Effort for data preparation ranges around 10%-60%
of data analysis process, depending on the dataset.

What type of variables we have?

- Numerical:
o Continuous, entities get a distinct score.
▪ E.g. temperature, body length
o Discrete, counts.
▪ Number of defects
- Categorical, entities are divided into distinct categories.
o Binary variable, two outcomes
▪ Dead or alive
o Ordinal variable
▪ Bad, intermediate, good
o Nominal variable
▪ Whether someone is an omnivore, vegetarian or vegan.

Outliers

- Outliers or unusual values are observations that are unusual and go against the trend of the
remaining data.
- Outliers may represent errors in data entry.

, - Outlines may suggest important new science.
- Even if an outlier is a valid data point, certain statistical methods are very sensitive to outliers
and may produce unstable results.
- Often, it is easiest to identify outliers by graphing the data

Identify Outliers

- Histogram: Outlier visualize in the histogram
of numeric feature values; they may be the
values on the tails.
- Boxplot: we can observe the outliers using
Boxplot. Boxplot represents the distribution
of the feature.
- Two-dimensional scatter plots help
determine outliers in more than one
variable.

Handling outliers

- Drop the entire row with the outliers (NOT recommended)

diamonds2= filter(diamonds, between( y, 3, 20) )

- Replace outlies with missing values (recommended)

diamonds2= mutate(diamonds, y=ifelse (y<3 | y>20, NA, y))

o In this way we treat the outliers as missing values. Note, in R, we show missing values
with NA (Not Available).

Handling Missing Data

- Missing values pose problems to data analysis methods.
- Delete records containing missing values?
o Dangerous, as pattern of missing values may be systematic.
o Valuable information in other fields lost.
o As much as 80% of the records lost if 5% of data values are missing from a data set of
30 variables.
- Imputation: is a method to fill in the missing values with estimated ones:
o Imputation with mean/median/mode
o Imputation with random values.
o Imputation using prediction models.
- Imputation with Mean/Mode/Median is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
o Imputation with Mode is used for the categorical attributes.
o Imputation with Mean and Median is used for the numerical attributes.
- Imputation with random values: Replace missing values (NAs) with values randomly taken
from underlying distribution.
o Benefit: measures of location and spread remain closer to original.

Measure of Centre

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper Egbert1976. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €8,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 69411 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis

Samenvatting

Summary- Data Wrangling (6012B0417Y)

Document informatie

Onderwerpen

Geschreven voor

1 beoordeling

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

In een paar klikken geregeld

Direct to-the-point

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?