Amsterdam Data Science and Artificial Intelligence
Data Wrangling (6012B0417Y)
Tous les documents sur ce sujet (1)
1
vérifier
Par: rizzokingma • 1 mois de cela
Vendeur
S'abonner
Egbert1976
Avis reçus
Aperçu du contenu
Data Wrangling
Lecture 1A: Introduction to Data Science
Contents: Course Information, Introduction to Data Science, 6-step of Data Science, Business
Understanding, Data Preparation
Active and Engagement Learning: learning by doing.
- Weekly exercises: each week we apply the theory in practice on real-world applications.
- Main project about Data Science.
Why R Software?
- R is free and Open-source Tool.
- R is one of the leading tools for Data Science,
Statistics, and Machine Learning.
- R contains actual machine and statistical techniques;
new techniques are made available in R very quickly.
What is Data Science?
- Data Science uses tools and techniques to turn data
into meaningful business insights.
o Goal: Use data to take better business
decisions.
- Data science combines data analysis, statistics,
machine learning, and related methodology to
manage and understand the data deluge associated with the emergence of information
technology.
o A data deluge is a scenario where more data is generated than can be successfully
and efficiently managed or capped.
- Data scientists are tasked with presenting digital information in a way that describes its
practical value in data-driven decision-making.
- However, they don’t typically endeavour (try) to solve specific questions in the way that
business analysts do when seeking out business analytics insights.
Concepts
Data surfing: ponder (think about) questions-> Explore potential data.
➔ Data Wrangling: Gather Data-> Clean the Data-> Connect Data-> Transform Data.
➔ Data mining: Choose Algorithm-> use algorithm-> test algorithm-> refine algorithm.
➔ Data Artistry: Use Discovery-> Share Discovery-> Present Discovery.
o Data Art or data-driven art is an artistic practice that relies on the usage of a
dataset to convey emotions to the audience.
Data Wrangling
- The process of transforming ‘raw’ data into data that can be analysed to generate valid
actionable insights.
- Getting your data into a form that is natural to work with.
- Tidying (sort out) and transforming data.
,6 steps of Data Science
- 6-Step of Data Science is a general problem-solving strategy of business/research unit, which
is an adaptive life cycle.
o An adaptive life cycle involves iterative (repeatedly) and incremental (increasing)
development, allowing for flexibility and adaptability in response to changing
requirements and conditions.
o Also known as Cross-industry standard process for data mining (CRISP-DM)
1. Business/Research Understanding Phase
o Define project requirements and objectives.
o Translate objectives into a data analytics problem definition.
o Prepare a preliminary strategy to meet objectives.
2. Data Preparation Phase
o Clean and prepare data so it is ready for modelling tools.
o Perform transformation of certain variables, if needed.
3. Exploratory Data Analysis
o Analysing data to summarize their main characteristics, often with. Visual methods.
o Seeing what the data can tell us beyond the formal modelling or hypothesis testing
task.
4. Modelling
o Select and apply one or more modelling technique.
o Calibrate model settings to optimize result.
5. Evaluation
o Evaluate one or more models for effectiveness.
o Determine whether defined objectives are achieved.
6. Deployment
o Make use of the models created.
, o Simple deployment example: generate a report.
o In business, the customer often carries out the deployment based on your model.
Business Understanding: What is the problem that you are trying to solve?
- In business understanding stage, you need to:
1. Understand the business process.
2. Define project requirements and objectives.
3. Translate objectives into a data analytics problem definition.
4. Prepare a preliminary (introductory) strategy to meet objectives.
Data preparation
- Clean and prepare data so it is ready for modelling tools.
- Perform transformation of certain variables, if needed.
- Raw data are often unprocessed, incomplete, noisy and may contain:
o Obsolete/redundant fields
o Missing values
o Outliers
o Data in a form not suitable for data analysis
o Values not consistent with policy or common sense
Data preprocess
- For data analytics purposes, database values must
undergo data cleaning and data transformation.
- Minimize GIGO (Garbage In-> Garbage Out).
o If GIGO is minimized-> then Garbage results
Out from model is minimized.
- Effort for data preparation ranges around 10%-60%
of data analysis process, depending on the dataset.
What type of variables we have?
- Numerical:
o Continuous, entities get a distinct score.
▪ E.g. temperature, body length
o Discrete, counts.
▪ Number of defects
- Categorical, entities are divided into distinct categories.
o Binary variable, two outcomes
▪ Dead or alive
o Ordinal variable
▪ Bad, intermediate, good
o Nominal variable
▪ Whether someone is an omnivore, vegetarian or vegan.
Outliers
- Outliers or unusual values are observations that are unusual and go against the trend of the
remaining data.
- Outliers may represent errors in data entry.
, - Outlines may suggest important new science.
- Even if an outlier is a valid data point, certain statistical methods are very sensitive to outliers
and may produce unstable results.
- Often, it is easiest to identify outliers by graphing the data
Identify Outliers
- Histogram: Outlier visualize in the histogram
of numeric feature values; they may be the
values on the tails.
- Boxplot: we can observe the outliers using
Boxplot. Boxplot represents the distribution
of the feature.
- Two-dimensional scatter plots help
determine outliers in more than one
variable.
Handling outliers
- Drop the entire row with the outliers (NOT recommended)
diamonds2= filter(diamonds, between( y, 3, 20) )
- Replace outlies with missing values (recommended)
o In this way we treat the outliers as missing values. Note, in R, we show missing values
with NA (Not Available).
Handling Missing Data
- Missing values pose problems to data analysis methods.
- Delete records containing missing values?
o Dangerous, as pattern of missing values may be systematic.
o Valuable information in other fields lost.
o As much as 80% of the records lost if 5% of data values are missing from a data set of
30 variables.
- Imputation: is a method to fill in the missing values with estimated ones:
o Imputation with mean/median/mode
o Imputation with random values.
o Imputation using prediction models.
- Imputation with Mean/Mode/Median is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the mean or median
(quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
o Imputation with Mode is used for the categorical attributes.
o Imputation with Mean and Median is used for the numerical attributes.
- Imputation with random values: Replace missing values (NAs) with values randomly taken
from underlying distribution.
o Benefit: measures of location and spread remain closer to original.
Measure of Centre
Les avantages d'acheter des résumés chez Stuvia:
Qualité garantie par les avis des clients
Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.
L’achat facile et rapide
Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.
Focus sur l’essentiel
Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.
Foire aux questions
Qu'est-ce que j'obtiens en achetant ce document ?
Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.
Garantie de remboursement : comment ça marche ?
Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.
Auprès de qui est-ce que j'achète ce résumé ?
Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur Egbert1976. Stuvia facilite les paiements au vendeur.
Est-ce que j'aurai un abonnement?
Non, vous n'achetez ce résumé que pour €8,49. Vous n'êtes lié à rien après votre achat.