Data Science Concepts summary
Introduction to Data Science: Lecture 1
Industry 1.0 production of the steam engine
Industry 2.0 advancement in the automatization of production and the advent of electricity
Industry 3.0 production of the computer
Industry 4.0 mass customization with the use of cyber-physical systems
Malthusian catastrophe: human population was growing exponentially while the earths resources
were growing at a much slower rate
Artificial Intelligence:
is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by
humans
the study of "intelligent agents": any device that perceives its environment and takes actions
that maximize its chance of successfully achieving its goals
Moore’s law: states that we can expect the speed and capability of our computers to increase every
couple of years, and we will pay less for them
Testing intelligence:
The turing test: A human evaluator judges natural language conversations between a human
and a machine. If the evaluator cannot reliably tell the machine from the human the machine
is said to have passed the test
The coffee test
The robot college student test
The employment test
The singularity: the point where machine intelligence exceeds the human intellect
The internet of things (IoT): is the network of physical objects or "things" embedded with electronics,
software, sensors, and network connectivity, which enables these objects to collect and exchange
data. Sensors built into the physical objects around us and powered by cloud technology allow the
“things” around us to help us with daily chores, monitor our health, sense the environment, adjust
temperature control systems, and endless other activities
Levels of smart product
1. Monitoring
2. Control
3. Optimizing
4. Autonomy
Big data: data sets so large or complex that traditional data processing applications are inadequate
Lecture 2
Data properties:
Structured data: most often used in data science (easiest)
o Tabulated, fixed categories/fields/ranges
o stored in databases, spreadsheets etc.
Unstructured data: everything else
o video, audio, text (books, e-mail, web pages)
1
, o can be “reduced” to structured data
e.g. represent a text as a vector of word counts
structured data:
quantitative
o numeric
continuous e.g. 2.7183, 3.1416, 17%
discrete e.g. 3, -1
o ordinal e.g. small/medium/large
qualitative
o categorical/nominal e.g. red/blue/green
o Boolean e.g. false/true, yes/no
Five v’s of big data:
1. Volume
2. Variety
3. Value
4. Velocity
5. Veracity
Big data: is not a goal in itself: use the data you need to answer a question, no more than that
the noise may not outweigh the additional data points
the bigger the data, the harder to visualize and process
solid relations/predictions are often simple enough to be found based on relatively little data
Big data analysis focuses on how to (computationally) handle large volumes of (unstructured) data
Data science focuses on how to process and analyse data to answer questions
Question types:
Can we find differences?
Can we predict a value or category?
Can we find groups/similarities?
Data analysis types:
Hypothesis testing: decide whether a value or difference is significant
Classification: learning to predict a category (label)
Regression: learning to predict a continuous output
Clustering: finding natural groups in data
Serendipity: using or combining data for new, unexpected purposes
Data acquisition: Lecture 3
Hardware: physical elements from which the system is built and actually performs the work
Software: a collection of data or computer instructions that tell the computer how to work
Application software: computer program designed to perform a group of coordinated
functions, tasks, or activities for the benefit of the user
System software: is computer software designed to provide services to other software
Enterprise application software: computer software used to satisfy the needs of an
organization rather than individual users
2
, Programming language: set of syntactic and semantic rules used to define computer programs,
consisting out of both syntax (how the various symbols of the language may be combined) and
semantics (the meaning of the language constructs)
Interpreters: translate & immediately execute each instruction in turn (one by one) (easier to debug)
Compiler: translate all instructions to machine code, save & execute as needed
interpreter Compiler
Translates programs one statement at a time Scans the entire program and translates it as a
whole into machine code
It takes less amount of time to analyse the It takes large amount of time to analyse the
source code but the overall execution time is source code but the overall execution time is
slower comparatively faster
No intermediate object code is generated, Generates intermediate object code which
hence are memory efficient further requires linking, hence requires more
memory
Continues translating the program until the first It generates the error message only after
error is met, in which case it stops. Hence scanning the whole program. Hence debugging
debugging is easy is comparatively hard
Programming language like python, ruby use Programming language like C, C++ use
interpreters compilers
Some programming languages both are a compiler and an interpreter (java and python)
Python:
Programming language
interpreted, high-level, general-purpose programming language
Its language constructs and object-oriented approach aim to help programmers write clear,
logical code for small and large-scale projects
Python is dynamically typed and garbage-collected
It supports multiple programming paradigms, including procedural, object-oriented, and
functional programming
Python is often described as a "batteries included" language due to its comprehensive
standard library
Python variables:
Variables are containers for storing data values
Unlike other programming languages, Python has no command for declaring a variable.
A variable is created the moment you first assign a value to it
Python data types:
In programming, data type is an important concept
Variables can store data of different types, and different types can do different things
Python has several data types built-in by default
Python functions
A function is a block of code which only runs when it is called
You can pass data, known as parameters, into a function
A function can return data as a result
There are three types of functions in Python:
o Built-in functions, such as help() to ask for help, min() to get the minimum value,
print()
3