Machine Learning (Data Mining) - Samenvatting (slides en handboek)
Full Summary of Chapters and Lecture Slides Data Science for Business
All for this textbook (26)
Written for
Technische Universiteit Eindhoven (TUE)
Industrial Engineering
1BVK00 (1BVK00)
All documents for this subject (1)
1
review
By: lucasbosgoed • 1 year ago
Seller
Follow
julidekok
Reviews received
Content preview
1BVK00: Business Analytics & Decision Support
Lecture 1: Data analytical thinking (Chapter 1&2)
Types of decisions
• Strategical: unstructured, one-time employee levels, industry trends, rebranding
• Tactical: semi-structured, reporting forecasts, pricing, profitability
• Operational: structured, recurrent scheduling, order processing
Data science: interdisciplinary fields using variety of techniques to create value based on
extracting knowledge
• Extracting useful/valuable knowledge to solve business problems in a systematic way of
well-defined stages
- requires good understanding of application domain
- considers ethics, business models, human behaviour
• CRISP-DM methodology: Cross Industry Standard Process for Data Mining
- Dependency to context
- Finding informative (statistical) attributes
- Generalizing beyond the available data
1. Business Understanding: 4. Modeling:
- Business objectives - Select modeling techniques
- Success criteria (KPI) - Build/train model
- Project plan - Prediction
- Deliverables
2. Data Understanding 5. Evaluation:
- Model validation
- Initial data collection
- Data description - Performance metrics
- Data Exploration - Visualization
3. Data preparation - Review results
6. Deployment:
- Data cleaning
- Model in production
- Sampling
- Normalization
- Feature Selection
,Data mining tasks
• Classification: Determine which discrete category the example is
• Regression: attempts to estimate or predict, for each individual, the numerical value of
some variable for that individual.
• Clustering: attempts to group individuals in a population together by their similarity, but
not driven by any specific purpose.
• Similarity matching: attempts to identify similar individuals based on data known about
them.
• Co-occurrence grouping: attempts to find associations between entities based on
transactions involving them.
• Profiling: attempts to characterize the typical behavior of an individual, group, or
population.
• Link prediction: attempts to predict connections between data items, usually by
suggesting that a link should exist, and possibly also estimating the strength of the link.
Lecture 2: Business Problems & Data Science Solutions (Chapter 2&3&4)
• Unsupervised learning: there is no specific target
• Supervised learning: there exists a specific target
How to detect these issues?
- Visualization: Visualizing all the values of each feature or taking a random sample to
see if it’s right.
- Outlier analysis: Analyzing if data can be a human error. E.g. a 300 year old person
in the “age” feature.
- Validation code: It’s possible to create a code that checks if the data is right. E.g., in
uniqueness, checking if the length of the data is the same as the length of the vector
of unique values.
Major tasks for preparing good dataset:
• Dealing with missing data
1. Ignore records (use only cases with all values)
- Not effective when the percentage of missing values per attribute varies
considerably as it can lead to insufficient and/or biased sample sizes
2. Ignore attributes with missing values
- Use only features (attributes) with all values (may leave out important features)
3. Use a global constant to fill in the missing value
- e.g., “unknown”. (May create a new class!) Building good datasets
4. Use the attribute mean to fill in the missing value
5. Use the attribute median or mode to fill in the missing value
6. Many other techniques.
• Handling categorical data
- Represented as strings or categories and are in finite numbers
- Ordinal Data: The categories have an inherent order
- Nominal Data: The categories do not have an inherent order
• Building features onto the same scale
- Feature scaling is a crucial step in data preprocessing
- Feature scaling is useful when features values highly vary in magnitudes, units and
range such as age, salary, weight, etc.
- Gradient decent and distance based methods behave much better if features are on
the same scale
- Tree based methods (e.g., Decision tree, Random forest) are invariant to feature
scaling
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller julidekok. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.89. You're not tied to anything after your purchase.