Machine Learning for the Quantified Self - Book Summary
Chapter 1: Introduction
1.1 The Quantified Self
The quantified self is any individual engaged in the self-tracking of any kind of biological,
physical, behavioral, or environmental information. The selftracking is driven by a certain goal of
the individual with a desire to act upon the collected information.
What drives quantified selves to gather information? Three broad categories:
● Improve health (e.g. cure or manage a condition, achieve a goal, execute a treatment
plan)
● Enhance other aspects of life (maximize work performance, be mindful)
● Find new life experiences (e.g. learn to increasingly enjoy activities, learn new things).
Five-Factor-Framework of Self-Tracking Motivations
● Self-healing (help yourself to become healthy)
● Self-discipline (like the rewarding aspects of the quantified self)
● Self-design (control and optimize yourself using the data)
● Self-association (enjoying being part of a community and to relate yourself to the
community)
● Self-entertainment (enjoying the entertainment value of the self-tracking)
Since self-tracking data can be misused or used in a way that is not fully in the interest of a
person, it is not surprising that users state the loss of privacy as their main concern in this
context.
1.2 The Goal of this Book
Machine learning is to automatically identify patterns from data. Specifically, to automatically
extract patterns from collected data and to enable a user to act upon insights effectively, which
in turn contributes to the goal of the user.
1
,Machine Learning for the Quantified Self - Book Summary
Unique characteristics of machine learning in the quantified self context
● Sensory data is noisy
● Many measurements are missing
● The data has a highly temporal nature
● Algorithms should enable the support of and interaction with users without a long
learning period
● We collect multiple datasets (one per user) and can learn across them
1.3 Basic Terminology
A measurement is one value for an attribute recorded at a specific time point. They can be
numerical, or categorical with an ordering (ordinal) or without (nominal). Measurements
frequently come in sequences, what we call time series. A time series is a series of
measurements in temporal order
Machine learning is commonly divided into four types of learning problems
● Supervised learning: the machine learning task of inferring a function from labeled
training data
● Unsupervised learning: there is no target measure (or label), and the goal is to
describe the associations and patterns among the attributes
● Semi-supervised learning: a technique to learn patterns in the form of a function based
on labeled and unlabeled training examples
● Reinforcement learning: tries to find optimal actions in a given situation so as to
maximize a numerical reward that does not immediately come with the action but later in
time. The learner is not told which actions to take as in supervised learning but instead
must discover which actions yield the highest reward over time by trying them.
1.4 Basic Mathematical Notation
2
,Machine Learning for the Quantified Self - Book Summary
1.5 Overview of the Book
Chapter 2: Basics of Sensory Data
2.1 Crowdsignals Dataset
There exists a huge variety of sensors. Popular (smartphone) sensors:
● Accelerometer: measures the changes in forces upon the phone on the x, y, z-plane
● Gyroscope: measures the orientation of the phone compared to the “down” direction (the
earth’s surface) and the angular velocity
● Magnetometer: measures the x-, y-, and z-orientation relative to the earth’s magnetic
field
● GPS signal: measures your position by means of your distance to a number of satellites
of which the position is known
2.2 Converting the Raw Data to an Aggregated Data Format
In order to convert the temporal data, we first need to determine the time step size we are going
to use in our dataset. This is also referred to as the level of granularity (selecting a ∆t). The
selection of the step size depends on a variety of factors, including the task, the noise level, the
available memory and cost of storage, the available computational resources for the machine
learning process, etc. Once we have selected this step size we can create an empty dataset.
We start with the earliest time point observed in our crowdsignals measurements and generate
a first row xtstart . Iteratively, we create additional rows for the following time steps by taking the
previous time step and adding our step size, e.g. xtstart+∆t .
3
, Machine Learning for the Quantified Self - Book Summary
Each row xt represents a summary of the values encountered in the interval defined by the time
step it was created for until the next time step. We continue until we have reached the last time
step in our dataset. Next, we should identify the columns in our dataset (our attributes) that we
want to aggregate. For the numerical values (e.g., heart rate), we create a single column for
each variable we measure while for the categorical values we create a separate column for
each possible value.
Once we have defined the entire empty dataset, we are ready to derive the values for each
attribute at each discrete time step (i.e. each row). We can aggregate numerical values by
averaging the relevant measurements or we can sum them up (e.g. when the measurements
concern a quantity) or use other descriptive metrics from statistics such as median or variance.
For categorical values we can count whether at least one measurements of that value has been
found in the interval (binary) or we can count the number of measurements that have been
found for the value (sum).
2.4 Machine Learning Tasks
Focusing on supervised learning we define two tasks:
● A classification problem, namely predicting the label (i.e. activity) based on the sensors
● A regression problem, namely predicting the heart rate based on the other sensory
values and the activity
Chapter 3: Handling Noise and Missing Values in Sensory Data
Three approaches for handling noise:
1. Detect and remove outliers from our data
2. Impute missing values in our data (could also have been outliers that were removed)
3. Transform our data to identify most important parts
3.1 Detecting Outliers
An outlier is an observation point that is distant from other observations. There are two types:
● Those caused by a measurement error, which may be removed based on
○ domain knowledge
○ visual inspection
○ trying whether we improve on our machine learning tasks when we remove them
● Those simply caused by variability of the phenomenon that we observe or measure
Distribution-Based Models → outlier removal is based on the probability distribution of the data
● Chauvenets criterion: identify values for an attribute that are
unlikely given a single normal distribution N(μ, σ2) to
describe the data
○ Given that we have N measurements for attribute Xj,
we compute the mean μ and standard deviation σ of
our data:
4