A collection of questions from the slides, previous exam questions and questions found online for 'Fundamentals of Data Science'. Used as preparation for oral exam.
Recap: Pre-processing
Q1: What is the importance of pre-processing?
Importance of Pre-processing:
- Data Quality Improvement: Pre-processing ensures the data is clean and free
from errors or inconsistencies (e.g., removing duplicates, handling missing
values).
- Data Consistency: Standardizes data to ensure consistency, making it suitable
for analysis.
- Feature Engineering: Transforms raw data into meaningful features that
enhance the predictive power of models.
- Algorithm Compatibility: Prepares data to meet the requirements of specific
algorithms, such as encoding categorical variables for models that only handle
numerical data.
- Enhanced Performance: Improves the efficiency and accuracy of models by
ensuring that the input data is appropriately formatted and scaled.
Q2: True or false? Explain. Pre-processing is a standardized procedure that is
independent of the model that will be used afterwards.
False.
Explanation: Pre-processing is not entirely standardized and often depends on
the specific requirements of the model to be used. Different models have
different requirements; for example:
- Decision Trees: May not require normalization or scaling of features.
- Linear Models and Neural Networks: Often require features to be
normalized or standardized.
- Algorithms Handling Categorical Data: Some models (e.g., tree-based
models) can handle categorical variables directly, while others (e.g.,
linear regression, SVM) require these variables to be encoded (e.g., one
hot encoding).
,Q3: True or false? Explain. One-hot encoding a categorical feature with
originally 3 separate categories results in 3 new columns.
False.
Explanation: One-hot encoding a categorical feature with 3 categories results
in 2 new columns. In one-hot encoding, n categories are transformed into n-1
new binary columns to avoid multicollinearity in linear models. Each new
column represents a distinct category, with a 1 indicating the presence of the
category and 0 indicating absence.
Q4: When one-hot encoding, what happens to the original categorical feature?
Why?
When one-hot encoding, the original categorical feature is replaced by the new
binary columns.
Reason:
- The original categorical feature is transformed into a set of binary (0 or 1)
columns, each representing a unique category. This transformation allows
algorithms that require numerical input to process the categorical data
effectively.
- Removing the original categorical feature helps prevent redundancy and
multicollinearity (when one predictor variable in a model can be linearly
predicted from the others with a substantial degree of accuracy), which can
negatively affect model performance and interpretability in linear models.
, Q5: Campaign Example:
Consider a company that wants to use data science to improve its targeting of
costly personally targeted advertisements. The company runs a test campaign,
targeting those who are most likely to respond according to their expert. As a
campaign progresses, more and more data arrive on people who make
purchases after having seen the ad versus those who do not. These data can be
used to build models to discriminate between those to whom we should and
should not advertise. Examples can be put aside to evaluate how accurate the
models are in predicting whether consumers will respond to the ad.
When the resulting models are put into production, targeting their full
customer base “in the wild,” the company is surprised that the models do not
work as well as they did in the lab. Why does it not work?
Scenario: A company uses data science to target ads, builds models based on
test campaign data, but finds the models underperform in production. Why
does it not work?
• Sampling Bias: Training data from the test campaign may not be
representative of the entire customer base.
Solution: Use a more representative sample for training.
• Data Drift: Customer behaviour changes over time, making the model
outdated.
Solution: Continuously update models with new data.
• Overfitting: Models fit too closely to the training data and fail to generalize.
Solution: Apply regularization, cross-validation, and simpler models.
• Feature Mismatch: Features available in the lab might differ from those in
production.
Solution: Ensure consistency in feature availability and quality.
• Environmental Differences: Differences in operational environments
between lab and production.
Solution: Test models in environments that mimic production setups.
• Evaluation Metrics: Metrics used in the lab may not align with business
objectives.
Solution: Align model evaluation metrics with business goals and test
accordingly.
Voordelen van het kopen van samenvattingen bij Stuvia op een rij:
√ Verzekerd van kwaliteit door reviews
Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!
Snel en makkelijk kopen
Je betaalt supersnel en eenmalig met iDeal, Bancontact of creditcard voor de samenvatting. Zonder lidmaatschap.
Focus op de essentie
Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!
Veelgestelde vragen
Wat krijg ik als ik dit document koop?
Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.
Tevredenheidsgarantie: hoe werkt dat?
Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.
Van wie koop ik deze samenvatting?
Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jefdecuyper. Stuvia faciliteert de betaling aan de verkoper.
Zit ik meteen vast aan een abonnement?
Nee, je koopt alleen deze samenvatting voor €2,49. Je zit daarna nergens aan vast.