Amsterdam Data Science and Artificial Intelligence
Data Wrangling (6012B0417Y)
Alle documenten voor dit vak (1)
1
beoordeling
Door: rizzokingma • 1 maand geleden
Verkoper
Volgen
Egbert1976
Ontvangen beoordelingen
Voorbeeld van de inhoud
Data Wrangling
Lecture 1A: Introduction to Data Science
Contents: Course Information, Introduction to Data Science, 6-step of Data Science, Business
Understanding, Data Preparation
Active and Engagement Learning: learning by doing.
- Weekly exercises: each week we apply the theory in practice on real-world applications.
- Main project about Data Science.
Why R Software?
- R is free and Open-source Tool.
- R is one of the leading tools for Data Science,
Statistics, and Machine Learning.
- R contains actual machine and statistical techniques;
new techniques are made available in R very quickly.
What is Data Science?
- Data Science uses tools and techniques to turn data
into meaningful business insights.
o Goal: Use data to take better business
decisions.
- Data science combines data analysis, statistics,
machine learning, and related methodology to
manage and understand the data deluge associated with the emergence of information
technology.
o A data deluge is a scenario where more data is generated than can be successfully
and efficiently managed or capped.
- Data scientists are tasked with presenting digital information in a way that describes its
practical value in data-driven decision-making.
- However, they don’t typically endeavour (try) to solve specific questions in the way that
business analysts do when seeking out business analytics insights.
Concepts
Data surfing: ponder (think about) questions-> Explore potential data.
➔ Data Wrangling: Gather Data-> Clean the Data-> Connect Data-> Transform Data.
➔ Data mining: Choose Algorithm-> use algorithm-> test algorithm-> refine algorithm.
➔ Data Artistry: Use Discovery-> Share Discovery-> Present Discovery.
o Data Art or data-driven art is an artistic practice that relies on the usage of a
dataset to convey emotions to the audience.
Data Wrangling
- The process of transforming ‘raw’ data into data that can be analysed to generate valid
actionable insights.
- Getting your data into a form that is natural to work with.
- Tidying (sort out) and transforming data.
,6 steps of Data Science
- 6-Step of Data Science is a general problem-solving strategy of business/research unit, which
is an adaptive life cycle.
o An adaptive life cycle involves iterative (repeatedly) and incremental (increasing)
development, allowing for flexibility and adaptability in response to changing
requirements and conditions.
o Also known as Cross-industry standard process for data mining (CRISP-DM)
1. Business/Research Understanding Phase
o Define project requirements and objectives.
o Translate objectives into a data analytics problem definition.
o Prepare a preliminary strategy to meet objectives.
2. Data Preparation Phase
o Clean and prepare data so it is ready for modelling tools.
o Perform transformation of certain variables, if needed.
3. Exploratory Data Analysis
o Analysing data to summarize their main characteristics, often with. Visual methods.
o Seeing what the data can tell us beyond the formal modelling or hypothesis testing
task.
4. Modelling
o Select and apply one or more modelling technique.
o Calibrate model settings to optimize result.
5. Evaluation
o Evaluate one or more models for effectiveness.
o Determine whether defined objectives are achieved.
6. Deployment
o Make use of the models created.
, o Simple deployment example: generate a report.
o In business, the customer often carries out the deployment based on your model.
Business Understanding: What is the problem that you are trying to solve?
- In business understanding stage, you need to:
1. Understand the business process.
2. Define project requirements and objectives.
3. Translate objectives into a data analytics problem definition.
4. Prepare a preliminary (introductory) strategy to meet objectives.
Data preparation
- Clean and prepare data so it is ready for modelling tools.
- Perform transformation of certain variables, if needed.
- Raw data are often unprocessed, incomplete, noisy and may contain:
o Obsolete/redundant fields
o Missing values
o Outliers
o Data in a form not suitable for data analysis
o Values not consistent with policy or common sense
Data preprocess
- For data analytics purposes, database values must
undergo data cleaning and data transformation.
- Minimize GIGO (Garbage In-> Garbage Out).
o If GIGO is minimized-> then Garbage results
Out from model is minimized.
- Effort for data preparation ranges around 10%-60%
of data analysis process, depending on the dataset.
What type of variables we have?
- Numerical:
o Continuous, entities get a distinct score.
▪ E.g. temperature, body length
o Discrete, counts.
▪ Number of defects
- Categorical, entities are divided into distinct categories.
o Binary variable, two outcomes
▪ Dead or alive
o Ordinal variable
▪ Bad, intermediate, good
o Nominal variable
▪ Whether someone is an omnivore, vegetarian or vegan.
Outliers
- Outliers or unusual values are observations that are unusual and go against the trend of the
remaining data.
- Outliers may represent errors in data entry.
, - Outlines may suggest important new science.
- Even if an outlier is a valid data point, certain statistical methods are very sensitive to outliers
and may produce unstable results.
- Often, it is easiest to identify outliers by graphing the data
Identify Outliers
- Histogram: Outlier visualize in the histogram
of numeric feature values; they may be the
values on the tails.
- Boxplot: we can observe the outliers using
Boxplot. Boxplot represents the distribution
of the feature.
- Two-dimensional scatter plots help
determine outliers in more than one
variable.
Handling outliers
- Drop the entire row with the outliers (NOT recommended)
diamonds2= filter(diamonds, between( y, 3, 20) )
- Replace outlies with missing values (recommended)
o In this way we treat the outliers as missing values. Note, in R, we show missing values
with NA (Not Available).
Handling Missing Data
- Missing values pose problems to data analysis methods.
- Delete records containing missing values?
o Dangerous, as pattern of missing values may be systematic.
o Valuable information in other fields lost.
o As much as 80% of the records lost if 5% of data values are missing from a data set of
30 variables.
- Imputation: is a method to fill in the missing values with estimated ones:
o Imputation with mean/median/mode
o Imputation with random values.
o Imputation using prediction models.
- Imputation with Mean/Mode/Median is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
o Imputation with Mode is used for the categorical attributes.
o Imputation with Mean and Median is used for the numerical attributes.
- Imputation with random values: Replace missing values (NAs) with values randomly taken
from underlying distribution.
o Benefit: measures of location and spread remain closer to original.
Measure of Centre
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
√ Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper Egbert1976. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €8,49. Je zit daarna nergens aan vast.