Data Wrangling
Lecture 1A: Introduction to Data Science
Contents: Course Information, Introduction to Data Science, 6-step of Data Science, Business
Understanding, Data Preparation
Active and Engagement Learning: learning by doing.
- Weekly exercises: each week we apply the theory in practice on real-world applications.
- Main project about Data Science.
Why R Software?
- R is free and Open-source Tool.
- R is one of the leading tools for Data Science,
Statistics, and Machine Learning.
- R contains actual machine and statistical techniques;
new techniques are made available in R very quickly.
What is Data Science?
- Data Science uses tools and techniques to turn data
into meaningful business insights.
o Goal: Use data to take better business
decisions.
- Data science combines data analysis, statistics,
machine learning, and related methodology to
manage and understand the data deluge associated with the emergence of information
technology.
o A data deluge is a scenario where more data is generated than can be successfully
and efficiently managed or capped.
- Data scientists are tasked with presenting digital information in a way that describes its
practical value in data-driven decision-making.
- However, they don’t typically endeavour (try) to solve specific questions in the way that
business analysts do when seeking out business analytics insights.
Concepts
Data surfing: ponder (think about) questions-> Explore potential data.
➔ Data Wrangling: Gather Data-> Clean the Data-> Connect Data-> Transform Data.
➔ Data mining: Choose Algorithm-> use algorithm-> test algorithm-> refine algorithm.
➔ Data Artistry: Use Discovery-> Share Discovery-> Present Discovery.
o Data Art or data-driven art is an artistic practice that relies on the usage of a
dataset to convey emotions to the audience.
Data Wrangling
- The process of transforming ‘raw’ data into data that can be analysed to generate valid
actionable insights.
- Getting your data into a form that is natural to work with.
- Tidying (sort out) and transforming data.
,6 steps of Data Science
- 6-Step of Data Science is a general problem-solving strategy of business/research unit, which
is an adaptive life cycle.
o An adaptive life cycle involves iterative (repeatedly) and incremental (increasing)
development, allowing for flexibility and adaptability in response to changing
requirements and conditions.
o Also known as Cross-industry standard process for data mining (CRISP-DM)
1. Business/Research Understanding Phase
o Define project requirements and objectives.
o Translate objectives into a data analytics problem definition.
o Prepare a preliminary strategy to meet objectives.
2. Data Preparation Phase
o Clean and prepare data so it is ready for modelling tools.
o Perform transformation of certain variables, if needed.
3. Exploratory Data Analysis
o Analysing data to summarize their main characteristics, often with. Visual methods.
o Seeing what the data can tell us beyond the formal modelling or hypothesis testing
task.
4. Modelling
o Select and apply one or more modelling technique.
o Calibrate model settings to optimize result.
5. Evaluation
o Evaluate one or more models for effectiveness.
o Determine whether defined objectives are achieved.
6. Deployment
o Make use of the models created.
, o Simple deployment example: generate a report.
o In business, the customer often carries out the deployment based on your model.
Business Understanding: What is the problem that you are trying to solve?
- In business understanding stage, you need to:
1. Understand the business process.
2. Define project requirements and objectives.
3. Translate objectives into a data analytics problem definition.
4. Prepare a preliminary (introductory) strategy to meet objectives.
Data preparation
- Clean and prepare data so it is ready for modelling tools.
- Perform transformation of certain variables, if needed.
- Raw data are often unprocessed, incomplete, noisy and may contain:
o Obsolete/redundant fields
o Missing values
o Outliers
o Data in a form not suitable for data analysis
o Values not consistent with policy or common sense
Data preprocess
- For data analytics purposes, database values must
undergo data cleaning and data transformation.
- Minimize GIGO (Garbage In-> Garbage Out).
o If GIGO is minimized-> then Garbage results
Out from model is minimized.
- Effort for data preparation ranges around 10%-60%
of data analysis process, depending on the dataset.
What type of variables we have?
- Numerical:
o Continuous, entities get a distinct score.
▪ E.g. temperature, body length
o Discrete, counts.
▪ Number of defects
- Categorical, entities are divided into distinct categories.
o Binary variable, two outcomes
▪ Dead or alive
o Ordinal variable
▪ Bad, intermediate, good
o Nominal variable
▪ Whether someone is an omnivore, vegetarian or vegan.
Outliers
- Outliers or unusual values are observations that are unusual and go against the trend of the
remaining data.
- Outliers may represent errors in data entry.
, - Outlines may suggest important new science.
- Even if an outlier is a valid data point, certain statistical methods are very sensitive to outliers
and may produce unstable results.
- Often, it is easiest to identify outliers by graphing the data
Identify Outliers
- Histogram: Outlier visualize in the histogram
of numeric feature values; they may be the
values on the tails.
- Boxplot: we can observe the outliers using
Boxplot. Boxplot represents the distribution
of the feature.
- Two-dimensional scatter plots help
determine outliers in more than one
variable.
Handling outliers
- Drop the entire row with the outliers (NOT recommended)
diamonds2= filter(diamonds, between( y, 3, 20) )
- Replace outlies with missing values (recommended)
diamonds2= mutate(diamonds, y=ifelse (y<3 | y>20, NA, y))
o In this way we treat the outliers as missing values. Note, in R, we show missing values
with NA (Not Available).
Handling Missing Data
- Missing values pose problems to data analysis methods.
- Delete records containing missing values?
o Dangerous, as pattern of missing values may be systematic.
o Valuable information in other fields lost.
o As much as 80% of the records lost if 5% of data values are missing from a data set of
30 variables.
- Imputation: is a method to fill in the missing values with estimated ones:
o Imputation with mean/median/mode
o Imputation with random values.
o Imputation using prediction models.
- Imputation with Mean/Mode/Median is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
o Imputation with Mode is used for the categorical attributes.
o Imputation with Mean and Median is used for the numerical attributes.
- Imputation with random values: Replace missing values (NAs) with values randomly taken
from underlying distribution.
o Benefit: measures of location and spread remain closer to original.
Measure of Centre