Introduction to Data Science (Week 1, Video 1)
What is Data Science?
= Is a “concept to unify statistics, data analysis and their related methods” in order to “understand and
analyze actual phenomena” with data.
What makes a Data Scientist?
= Data scientist use their data and analytical ability to find and interpret rich data sources; manage large
amounts of data (…); create visualizations to aid in understanding data; build mathematical models
using the data; and present and communicate the data insights/findings.
A lot of Related Fields
Artificial Intelligence; focuses on intelligent behaviors (human copy; mimicking). More interested in
doing better than humans
Machine Learning; focuses on certain learning objectives wanting to achieve by programming different
functions and algorithms
Data mining; VR/Sensory, Medical as input
Information Retrieval; what you do when you type in something in Google (users are interested in it,
doing a search online – Siri fe.)
Natural Language Processing; deals with interpretational language, clever interpretation. Some of the
systems capture better information than we as humans do.
Computer Vision; focuses on the vision system we as humans have, processing images, getting
information from them, object classification within images.
Audio Signal Processing; deals with audio, speech, music.
Cognitive Sciences; deals with the brain specifically, processes of the brain (too broad)
Intelligent Games; where the agents in a game behave intelligently (with machine learning fe)
Agents (Biology); simulate certain agents against games, entities, animals can we make sense of their
behavior etc?
→ Know the minor differences!
ONE COMMONALITY: DATA-DRIVEN SCIENCE (ML/DM)
What is Data?
Example clouds: what does the weather have that is data? (Table with certain attributes)
What you can see is the outlook. What is the temperature, wind, can I play outside? YES/NO
• We want to use this data to predict something, make a classification
• Sunny or not, this kid doesn’t play outside when it’s sunny.
• Windy condition, not really a good point for play.
• There is not enough data to make a good interpretation
Convert the data into features
• Convert outlook into numbers.
o 1 = Sunny
o 0 = Cloudy
o 2 = Rainy
• Do the same for wind and temperature, where there is a mapping for values.
o 1 = yes
o 0 = no
o Binary representation, called attributes
• Features are the same as attributes!
Another measurements could be the degrees, temperature feelings, amount of rainfall and probability of
thunder.
• Scale is unknown cause it is degrees, percentage, km/h etc (units are different)
Another measurement could be with image data / combination with other data sources (photo of
clouds over map)
• Might combine it with other information such as ticket sales for a theme park.
This is INTERPRETING DATA!
Back to our data (of playing outside)
Can you come up with some rules for playing outside?
Conditions (Rules for prediction):
• If it’s sunny & hot → kid does not want to play outside.
• If it’s windy → kid does not want to play outside.
1
,• Then 1 rule left: if it’s not windy and not hot → kid wants to play.
We want to predict our target P L A Y given the features we have available.
FORMALLY!
• We have our data: X (with features: outlook, temp,
windy)
o Features can be continuous and discrete
o Continuous features: are real valued and can
be within some range.
o Discrete features: finite values and usually
associated with some label of category.
• Our data exists of smaller instances, ‘some instance’ is
written as: x.
• If we want to specifically point at a particular instance
(say our first row), we write: x1. We can see our model
as a function f, that when given any instance x, gives us
a prediction ŷ.
• The application of the model to some instance in our
data can be written as f(x).
• Our hope is that ŷ is the same as our target: y.
Quick Recap of Example
• Features: X (outlook, temp, windy)
• Targets: Y (play)
• Some instance: x
• Some target: y
• First column: x1 (sunny, hot, no)
• First target: y1 (no)
• Model: if it’s not windy and not hot → play (f)
• Predictions by f(x): ŷ
• Prediction for f(x1): ŷ1 (no)
Predictive Model (OR ALGORITHM)
def play predictor(data):
if data[“windy”] == ‘no’ and data[“temp”] != ‘hot’:
return ‘play’
else:
return ‘no play’
It’s sunny, mild, and windy…should I play? Realistic?
It will return ‘no play’, because the algorithm says it is NOT windy and HOT weather (not mild).
How do we know if our model performs well?
• Correct evaluation is incredibly important in Data Mining.
• We came up with some rules, be how do w know they generalize; if the rules we learned apply with
the same success rate to data where we don’t know what the target is.
Results of our model
• 5/6 correct, so our model has 83.3% accuracy.
• Did we cover all conditions?
• What if we are presented with new conditions?
• Rules are probably too strict.
• Other than the training data we determined our rules by, we also need test data; unseen by us, to
evaluate.
Explanation of unseen data: REALISTIC USE CASE
PREDICTING HOUSING PRICES (great example of data mining)
• Would you be able to determine the price of a house? → You need expert knowledge.
• Many observations required to gain experience (mental representation to know higher house price
fe)
• Features to predict the price of a house?
o Amount of bedrooms
o Big garden
o Good neighborhood
HOW TO EVALUATE?
2
,• Previously we had a clear binary (yes/no) prediction.
• Say we had more classes, we would still be predicting a nominal target (different from a numeric
target, where you really want to predict a price range. Order does not matter).
• We can’t say: we got … out of … correct, and therefore use accuracy.
• We are more likely interested in how far our prediction was off from the actual value: this is error.
Complex information
• How would location affect price?
• How would pollution affect price?
• How about the good location but high pollution?
• Do you know how much of either would affect the price?
• Would one be able to easily craft a successful ruleset?
LEARNING TO PREDICT (some problems are very hard to solve for humans)
• Hand-made rules are not flexible
• Given more instances/observations, rules will become more complex, thus requiring better (more
complex) rules.
• Too much data becomes impossible to manually analyze.
• If done automatically, little expert knowledge is required; mostly data.
• Models can give information regarding underlying patterns and feature importances.
o If many rules mention location as a first condition to look at, that must be an important
feature.
You need good intuitions, domain expertise and get to know your data well (not just a bunch of
algorithms and you are “done”).
EXTRA MATERIAL
Quick discussion of:
• PC hardware and relation to data and algorithms
• Programming languages and their relation to above
This is not computer science, why do I need to know this>
• Algorithm choices often depend on hardware limitations.
• Some model families specifically deal with shortage of computation power.
• Different data types often relate to storage and processing.
• Certain terms are widespread throughout this course.
PC HARDWARE
Power supply (left corner)
CPU (processor)
HDD (storage, disks)
RAM (memory)
Motherboard (connects all the components)
HARD DRIVE (HDD/SSD)
Place where all your stuff is stored.
• Stores your files.
• HDD are larger (store more data, 1-5T) but slower (in reading/writing) and are fragile.
• SSDs are smaller (up to 1T), faster, more robust, but expensive.
• Most modern laptops come with an SSD
• For computations, algorithms/models read a particular set of data from you disks into memory.
MEMORY CHIP (RAM)
• Very fast reading/writing, but even more limited in space (8-16G up to 256G), very expensive.
• Algorithms can quickly access and manipulate data that is in memory.
• If memory limit is exceeded, computers usually freeze/processes slow down.
• Computations done on date in memory are commonly handled by the CPU.
PROCESSOR (CPU)
• Does computation part of a computer (berekeningen).
3
, • Can have multiple computation cores (duo core, quad core) to run operations in parallel (i.e.
simultaneously) which speeds up processes.
• The more expensive the CPU, the faster it does similar computations. The more cores, the faster it
runs parallel computations.
GRAPHICS CARD (GPU, special CPU)
• Some computations an be done on a GPU rather than the CPU.
• Commonly used for processing images or other visual content. Popular for video games.
• For ordinary systems, GPU is usually embedded in the CPU.
• GPU’s are very fast at ‘matrix operations’ and have therefore been popularized for Deep Learning
research (explained in future lectures).
• Has its own RAM (and therefore limitations).
Programming Languages
Python (this course): almost reads like English (high-level). C++ or Prolog are low-level (hard to read).
Representing Data (Week 1, Video 2)
Practical Lecture
LAST WEEK
• Data
• Features
• Algorithms
THIS WEEK
• Data
How do we get Data?
• Pre-mades: data sources that already have been compiled by people, clear prediction task, ideal to
work with as starting data scients
o Kaggle, UCI, Snap
• Dumps: big dumps of data
o IMDB, Reddit, MovieLens
• Scientific repositories: are always attached to some paper/research
o Dataverse
• (Web) API’s: common interfaces
o Twitter, Reddit
• Web scraping: where you do it yourself, select the fields you need. Becomes messy if you have to
use different websites cause every website has it’s own structure.
• At industry-level: databases (their own).
File Formats (3 main formats)
• CSV, comma separated values → flat data structure (table format), named columns and
rows. Separators and quotes.
• JSON → hierarchical (nested), lists and key/values. Widely used for API’s and document-
based databases (NOT FLAT)
• XML <data>, <movie id> → hierarchical, tags to name items. Very common standard; easy to
evaluate if according to some predefined structure.
• See Examples from Lecture Videos
What are Databases?
= Collections of hard disks. Internet connected machines that host a bunch of magnetic disks, that store
a lot of information. Because they are internet connected you can access them remotely. You would
connect to them via an IP address or URL.
https://mydatabase.com and will ask for a user + password. You will then be able to do queries to get
data from database to you. DBMS: Database Management System is a software between the PC and
the Database.
Databases are typically split in 2 types of different types of handling and structuring these queries:
Relational databases – Structured Query Language (SQL) databases
• Pre-defined, structured tables, relational → not easy to scale horizontally, but robust and well-
supported
• Ex.: MySQL, PostgreSQL, SQLite, MariaDB
Non-relational databases – NoSQL
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller sabrinadegraaf. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $5.40. You're not tied to anything after your purchase.