Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien
logo-home
Samenvatting (2024) Current topics in data science and Artificial Intelligence - 2101TEWDAS €10,50   Ajouter au panier

Resume

Samenvatting (2024) Current topics in data science and Artificial Intelligence - 2101TEWDAS

 19 vues  1 fois vendu

Dit document bevat de mogelijke examenvragen voor het vak "Current Topics in Data Science and Artificial Intelligence". De antwoorden zijn samengesteld op basis van informatie uit de hoorcolleges, PowerPoint-presentaties en aanvullend onderzoek, met als doel zo volledig mogelijke antwoorden te bied...

[Montrer plus]
Dernier document publié: 4 mois de cela

Aperçu 4 sur 64  pages

  • 3 juillet 2024
  • 3 juillet 2024
  • 64
  • 2023/2024
  • Resume
Tous les documents sur ce sujet (1)
avatar-seller
Rubi0Deputter
Lecture 1 - Véronique Van Vlasselaer (Customs fraud detection)
1) What are the steps in an analytics lifecycle? Describe the four main parts, and how they relate
to the customs case, as seen by Véronique Van Vlasselaer.




The analytics lifecycle generally consists of four main steps: Model Development, Model Deployment,
Decisioning and one overarching element ModelOps.
Develop model => deploy model => monitor model => make decisions based on the model => get
results => get feedback => monitor performance => retrain the model => deploy the model
Model Development: (Preparing data, exploring data and developing the model)
This step involves the creation of analytical models using machine learning (ML) algorithms. It
includes data preparation, feature engineering, model training, and validation.
In the customs case, historical data about packages (e.g., consignor, consignee, country of origin,
contents description, compliant/non-compliant) is used to develop models that can predict the
likelihood of a package being non-compliant or suspicious.
The development process emphasizes both data-centric and model-centric approaches. Data-centric
involves improving data quality and creating meaningful features, while model-centric involves
selecting and tuning the best-performing algorithms. => hybrid between both.
Feature engineering => process in which a set of meaningful features is derived from the raw data set
that improve data quality and machine learning model performance
Importance of feature engineering to augment models with business expertise, turned out to be the
secret ingredient for the success of all analytical models.
Statistical features and Business features engineering.
Statistical features engineering can be highly automated. => Feature Machine automatically
assesses data quality issues and automatically generates new features by performing the appropriate
feature transformations to obtain the optimal feature set.
Creation of RFM features
But a challenge is data imbalance, so highly skewed class distribution because only about 1% of the
cases is incompliant. So, you want to use re-sampling techniques when your data is skewed.
1) Undersampling: randomly select compliant cases and remove them.
2) Oversampling: duplicate the incompliant cases.
3) Hybrid approach: combination of under- and oversampling → they used this method.
But even with the over- and undersampling, it wasn’t enough → they applied the SMOTE technique
→ focusing on the minority class and they derived the nearest neighbors. = > create synthetic
samples

,Model centric approach: the better the algorithm, the better the results. The advice she gave was
that you should run multiple auto-tuned algorithms and choose the best one, not the most exotic
one. The best algorithm is defined as the combination between analytical accuracy and business
relevance and requirements. They use Recall and Precision as statistical fit statistics.
1. Recall: percentage of suspicious packages detected by the model
2. Precision evaluates how many of the true suspicious packages that were detected, of all
packages that were suspicious.
=> importance of linking evaluation metrics with business impact and requirements.

Legal considerations => Interpretability, global (how do features contribute) and local interpretability
(explain how the model comes to a certain decision for data observation) of the model.

Model Deployment:
Once developed, models are deployed into production environments where they can start processing
real-time data.
This step is about deploying, monitoring and updating the model when needed.
It is important to measure and manage the performance of models over time. Because as time
passes, the performance of most models drop over time => their predictions become less accurate.
This means identing issues such as data drift (changes in input data over time) and concept drift
(changes in model outcomes) that could negatively impact model performance. => expected model
performance.
Customs authorities track the model's performance by comparing predicted outcomes with actual
inspection results, adjusting the models as necessary to maintain accuracy. Actual model
performance.
In the customs case this is done automated monitoring with alerting systems, that check for the
different metrics and alert when issues arise. Additionally, they use challenger and champions
models. Based on the ongoing model performance, different champions and challengers can be
chosen. Base on this => re-train models or re-build models
Update models and rules frequently.
Decisioning:
The decisioning phase is about integrating the models and combine them with rules to measure the
results: rules → decisions → action → results.
AI MODELS ARE INTEGRATED INTO THE DECSIONING PROCESS
→ The decision logic consists of analytical insight + business rules + flow logic. This step is crucial for
effective fraud detection.
The final step focuses on using the insights generated by the models to make informed decisions.
This involves combining analytical insights with business rules and human expertise to drive
operational decisions.
In customs fraud detection, decisioning integrates predictive model outputs with business rules
(e.g., specific countries of origin triggering automatic inspections) and human expertise to decide
which packages to inspect.
This integration helps optimize workload and improves the efficiency of customs operations by
prioritizing high-risk packages for inspection.

,The modelOps process:
− Model Lifecycle: Continuous monitoring and improvement are essential to ensure that the fraud
detection models remain effective over time. This involves regularly re-evaluating model
performance and making necessary adjustments.
− The analytics lifecycle in the customs case requires continuous monitoring to ensure that the
models remain accurate and relevant, reflecting the dynamic nature of fraudulent activities.
Model Opps is the approach to go through the analytics lifecycle; many initiatives fail because they
are not operationalized.
These four steps form a continuous cycle, ensuring that models are not only accurate and effective at
the point of deployment but remain so through regular updates and refinements based on
performance monitoring and feedback from operational use. In the customs case, this lifecycle helps
maintain a high level of vigilance against fraud while streamlining the inspection process, thereby
enhancing overall operational efficiency.

, 2) Model performance monitoring:
• Indicate where it is situated in the analytics lifecycle model.
• What is measured, and how can it be measured. Make a distinction between post-hoc and
ex-ante evaluation techniques.
• What if the model performance is no longer sufficient?
Analytics lifecycle: Model development => Model deployment => Decisioning
Model performance monitoring is situated in the model deployment stage of the analytics lifecycle.
After deployment the performance of models is monitored to ensure ongoing effectiveness in real-
world scenarios.
Continuously monitor models => prediction of models drops off as time goes on, models become less
predictive and diverge from their true labels, malperformance.
WANT TO MEASURE MODEL PERFORMANCE OVER TIME
Ex ante and post hoc evaluation technique:
• Ex ante evaluation techniques are used to assess expected model performance. At the time
of model deployment, the true outcome of the target variable is often unknown. These
metrics allow us to evaluate the expected model performance without the need for the
actual value of the target variable. => data drift, concept drift, FCI
• Post hoc evaluation technique, these methods are used to assess the actual model
performance and require the true value of the target variable to be known. These methods
evaluate the model's performance based on actual outcomes compared to predictions. =>
Roc curve, AUC, lift, Gini, KS statistic
Data drift (input variable drift):
When the model is deployed into production, it faces real-world data. As the environment changes,
the data might differ from the data that the model was trained on. Data drift refers to changes in the
distribution (statistical properties) of the input data (e.g. change in volume of packages over time)
over time. => changes in the distribution of the input data over time.
These shifts can point to significant changes in behaviour that are due to changing external factors,
economic downturns/upturns, behavioural changes, regulation etc.
This shift in input data distribution can lead to a decline in the model's performance. The reason is,
when you create a machine learning model, you can expect it to perform well on data similar to the
data used to train it.
Want to measure the data stability over time. This can be done by calculating the shift in the
distribution on which the data was trained and a new current dataset that is now being fed into the
model. => Data stability report
Following formula can be used to quantify this:




This formula captures the divergence between the actual distribution of data and the expected
(original) distribution.
A low deviation index indicates that the distribution has remained stable, while a higher deviation
index suggests that the distribution has changed. If the training data set and the current data set
have identical distributions for a variable, the variable's deviation index is equal to 0.

Les avantages d'acheter des résumés chez Stuvia:

Qualité garantie par les avis des clients

Qualité garantie par les avis des clients

Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.

L’achat facile et rapide

L’achat facile et rapide

Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.

Focus sur l’essentiel

Focus sur l’essentiel

Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur Rubi0Deputter. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €10,50. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis)

67096 résumés ont été vendus ces 30 derniers jours

Fondée en 2010, la référence pour acheter des résumés depuis déjà 14 ans

Commencez à vendre!
€10,50  1x  vendu
  • (0)
  Ajouter