Summary data science skills
Inhoud
Course 1: Introduction ............................................................................................................................. 3
1.1 Python basics ................................................................................................................................. 3
1.2 Python lists .................................................................................................................................... 3
1.3 Functions and packages................................................................................................................. 5
1.4 Numpy (Numeric Python).............................................................................................................. 6
Course 2: Intermediate python ............................................................................................................... 8
2.1 Matplotlib ...................................................................................................................................... 8
2.2 Dictionaries & pandas.................................................................................................................... 9
2.3 Logic, Control Flow and Filtering ................................................................................................. 13
2.4 Loops ........................................................................................................................................... 15
2.5 Case study: hacker statistics ........................................................................................................ 17
2.5 Summary...................................................................................................................................... 19
Course 3: DataFrames............................................................................................................................ 20
3.1 Transforming DataFrames............................................................................................................ 20
3.2 Aggregating DataFrames; Summary statistics ............................................................................. 21
3.3 Slicing and Indexing DataFrames ................................................................................................. 23
3.4 Creating and Visualizing DataFrames .......................................................................................... 25
Course 4: Supply Chain Analytics in Python .......................................................................................... 28
4.1 Basics of supply chain optimization and PuLP ............................................................................. 28
4.2 Modeling in PuLP ......................................................................................................................... 29
4.3 Solve and evaluate model ........................................................................................................... 32
4.4 Sensitivity and simulation testing of model ................................................................................ 34
Course 5: Cleaning Data in Python ........................................................................................................ 38
5.1 Common data problems .............................................................................................................. 38
5.2 Text and categorical data problems ............................................................................................. 41
5.3 Advanced data problems ............................................................................................................. 43
5.4 Record linkage ............................................................................................................................. 46
Course 6: Cluster analysis ...................................................................................................................... 49
6.1 Introduction to clustering ............................................................................................................ 49
6.2 Hierarchical Clustering ................................................................................................................. 53
6.3 K-Means clustering ...................................................................................................................... 56
6.4 Clustering in the real world ......................................................................................................... 59
,Course 7: Machine Learning with scikit-learn (model testing) .............................................................. 63
7.1 Classification ................................................................................................................................ 63
7.2 Regression ................................................................................................................................... 66
7.3 Fine-tuning your model ............................................................................................................... 68
7.4 Preprocessing and pipelines ........................................................................................................ 70
Course 8: Linear classifiers .................................................................................................................... 73
8.1 Applying logistic regression and SVM .......................................................................................... 73
8.2 Loss functions .............................................................................................................................. 75
8.3 Logistic regression ....................................................................................................................... 77
8.4 Support Vector Machines (SVMs in detail) .................................................................................. 80
,Course 1: Introduction
1.1 Python basics
iPython shell = interactive
Python script > text files > use print to generate output
Use a # to add comments in a python script
Calculator
Variables and types
• Variables: named piece of memory that can store a value.
- Syntax: name = value
Usage:
- Compute an expression's result,
- Store that result into a variable,
- And use that variable later in the program.
• Types: Type(‘variable’)
- Float Decimal number
- Integer Whole number
- Strings Text ‘’’’
- Booleans True/False
> Different behaviour using operators for different types of floats.
> When working with different types -> Convert if necessary before using operators.
1.2 Python lists
Lists; store multiple values
• Lists: Lists are used for storing small amounts of one-dimensional data containing different types.
- But, can’t use directly with arithmetical (matrix) operators (+, -, *, /, ...).
- If you need efficient arrays with arithmetic and better multidimensional tools.
• Sublists: One list can contain more sublists
, Subsetting lists (access information in a list; indexes)
• Element: The number in a list. 1.68 is the fourth element
• Index: The index of an element in the list, it starts at 0. 1.68 has index 3
> To select an element using indexing: Fam[3] gives ‘1.68’
> Negative indexes Fam[-1] gives ‘1.89’
• Slicing: Select multiple elements in a list and creating a new list
Example: fam [3:5] returns [1.68, ‘mom’] (element 3 and 4)
> [Start ; End] -> Start is included, End is excluded!
> [:4] returns indexes 0, 1, 2 and 3 (elements 1, 2, 3, 4)
> [5:] returns indexes 5, 6, 7 (elements 6, 7, 8)
Subsetting lists of lists
x = [["a", "b", "c"],
["d", "e", "f"],
["g", "h", "i"]]
X[rows][columns]
x[2][0] Returns: ‘g’ (sublist 2 , index 0)
x[2][:2] Returns: [‘g’, ‘h’] (sublist 2 , index 0 and 1)
Manipulation Lists (update lists for commands)
• Changing the elements in a list (e.g. change, add, remove elements)
1. Change: Fam [7] = 1.86 Changes the height of dad
2. Change slice: Fam [0:2] = [“Lisa”, 1.74] Changes the 0 and 1 index
3. Adding/extend: Fam + [“me”, 1.79] Adds ‘me’ and 1.79 to the list
4. Remove: del(fam[2]) Removes “emma from the list”
> Watch out because the indexes of the list have now changes!
How lists work
> x and y are the referred to the same list. > Solution: create y as a new list.