Amsterdam Data Science and Artificial Intelligence
Data Wrangling (6012B0417Y)
All documents for this subject (1)
1
review
By: rizzokingma • 3 weeks ago
Seller
Follow
Egbert1976
Reviews received
Content preview
Data Wrangling
Lecture 1A: Introduction to Data Science
Contents: Course Information, Introduction to Data Science, 6-step of Data Science, Business
Understanding, Data Preparation
Active and Engagement Learning: learning by doing.
- Weekly exercises: each week we apply the theory in practice on real-world applications.
- Main project about Data Science.
Why R Software?
- R is free and Open-source Tool.
- R is one of the leading tools for Data Science,
Statistics, and Machine Learning.
- R contains actual machine and statistical techniques;
new techniques are made available in R very quickly.
What is Data Science?
- Data Science uses tools and techniques to turn data
into meaningful business insights.
o Goal: Use data to take better business
decisions.
- Data science combines data analysis, statistics,
machine learning, and related methodology to
manage and understand the data deluge associated with the emergence of information
technology.
o A data deluge is a scenario where more data is generated than can be successfully
and efficiently managed or capped.
- Data scientists are tasked with presenting digital information in a way that describes its
practical value in data-driven decision-making.
- However, they don’t typically endeavour (try) to solve specific questions in the way that
business analysts do when seeking out business analytics insights.
Concepts
Data surfing: ponder (think about) questions-> Explore potential data.
➔ Data Wrangling: Gather Data-> Clean the Data-> Connect Data-> Transform Data.
➔ Data mining: Choose Algorithm-> use algorithm-> test algorithm-> refine algorithm.
➔ Data Artistry: Use Discovery-> Share Discovery-> Present Discovery.
o Data Art or data-driven art is an artistic practice that relies on the usage of a
dataset to convey emotions to the audience.
Data Wrangling
- The process of transforming ‘raw’ data into data that can be analysed to generate valid
actionable insights.
- Getting your data into a form that is natural to work with.
- Tidying (sort out) and transforming data.
,6 steps of Data Science
- 6-Step of Data Science is a general problem-solving strategy of business/research unit, which
is an adaptive life cycle.
o An adaptive life cycle involves iterative (repeatedly) and incremental (increasing)
development, allowing for flexibility and adaptability in response to changing
requirements and conditions.
o Also known as Cross-industry standard process for data mining (CRISP-DM)
1. Business/Research Understanding Phase
o Define project requirements and objectives.
o Translate objectives into a data analytics problem definition.
o Prepare a preliminary strategy to meet objectives.
2. Data Preparation Phase
o Clean and prepare data so it is ready for modelling tools.
o Perform transformation of certain variables, if needed.
3. Exploratory Data Analysis
o Analysing data to summarize their main characteristics, often with. Visual methods.
o Seeing what the data can tell us beyond the formal modelling or hypothesis testing
task.
4. Modelling
o Select and apply one or more modelling technique.
o Calibrate model settings to optimize result.
5. Evaluation
o Evaluate one or more models for effectiveness.
o Determine whether defined objectives are achieved.
6. Deployment
o Make use of the models created.
, o Simple deployment example: generate a report.
o In business, the customer often carries out the deployment based on your model.
Business Understanding: What is the problem that you are trying to solve?
- In business understanding stage, you need to:
1. Understand the business process.
2. Define project requirements and objectives.
3. Translate objectives into a data analytics problem definition.
4. Prepare a preliminary (introductory) strategy to meet objectives.
Data preparation
- Clean and prepare data so it is ready for modelling tools.
- Perform transformation of certain variables, if needed.
- Raw data are often unprocessed, incomplete, noisy and may contain:
o Obsolete/redundant fields
o Missing values
o Outliers
o Data in a form not suitable for data analysis
o Values not consistent with policy or common sense
Data preprocess
- For data analytics purposes, database values must
undergo data cleaning and data transformation.
- Minimize GIGO (Garbage In-> Garbage Out).
o If GIGO is minimized-> then Garbage results
Out from model is minimized.
- Effort for data preparation ranges around 10%-60%
of data analysis process, depending on the dataset.
What type of variables we have?
- Numerical:
o Continuous, entities get a distinct score.
▪ E.g. temperature, body length
o Discrete, counts.
▪ Number of defects
- Categorical, entities are divided into distinct categories.
o Binary variable, two outcomes
▪ Dead or alive
o Ordinal variable
▪ Bad, intermediate, good
o Nominal variable
▪ Whether someone is an omnivore, vegetarian or vegan.
Outliers
- Outliers or unusual values are observations that are unusual and go against the trend of the
remaining data.
- Outliers may represent errors in data entry.
, - Outlines may suggest important new science.
- Even if an outlier is a valid data point, certain statistical methods are very sensitive to outliers
and may produce unstable results.
- Often, it is easiest to identify outliers by graphing the data
Identify Outliers
- Histogram: Outlier visualize in the histogram
of numeric feature values; they may be the
values on the tails.
- Boxplot: we can observe the outliers using
Boxplot. Boxplot represents the distribution
of the feature.
- Two-dimensional scatter plots help
determine outliers in more than one
variable.
Handling outliers
- Drop the entire row with the outliers (NOT recommended)
diamonds2= filter(diamonds, between( y, 3, 20) )
- Replace outlies with missing values (recommended)
o In this way we treat the outliers as missing values. Note, in R, we show missing values
with NA (Not Available).
Handling Missing Data
- Missing values pose problems to data analysis methods.
- Delete records containing missing values?
o Dangerous, as pattern of missing values may be systematic.
o Valuable information in other fields lost.
o As much as 80% of the records lost if 5% of data values are missing from a data set of
30 variables.
- Imputation: is a method to fill in the missing values with estimated ones:
o Imputation with mean/median/mode
o Imputation with random values.
o Imputation using prediction models.
- Imputation with Mean/Mode/Median is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
o Imputation with Mode is used for the categorical attributes.
o Imputation with Mean and Median is used for the numerical attributes.
- Imputation with random values: Replace missing values (NAs) with values randomly taken
from underlying distribution.
o Benefit: measures of location and spread remain closer to original.
Measure of Centre
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller Egbert1976. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $9.24. You're not tied to anything after your purchase.