Module 1 – Uncertainty
Contents
Module 1 – Uncertainty ............................................................................................................................................................. 2
Test and roll ........................................................................................................................................................................... 3
Option value .......................................................................................................................................................................... 3
Classical uncertainty .............................................................................................................................................................. 3
Bayesian approach ................................................................................................................................................................ 4
Comparing posteriors ............................................................................................................................................................ 4
Size of test group ................................................................................................................................................................... 4
Module 2 – RFM ......................................................................................................................................................................... 6
RFM ....................................................................................................................................................................................... 6
Empirical Bayes and clumpiness ............................................................................................................................................ 8
Lecture 3 – Logistic Regression................................................................................................................................................. 10
Churn ................................................................................................................................................................................... 10
Logistic regression ............................................................................................................................................................... 11
Overfitting ........................................................................................................................................................................... 12
Lifts and optimal targeting .................................................................................................................................................. 14
Lecture 4 – Subset selection, LASSO, decision trees and random forests ................................................................................ 17
Subset (of predictors) selection ........................................................................................................................................... 17
LASSO .................................................................................................................................................................................. 18
Decision trees ...................................................................................................................................................................... 19
Random forests ................................................................................................................................................................... 21
Lecture 5 – Collaborative filtering, Cross-selling, Upselling...................................................................................................... 22
NPTB models ....................................................................................................................................................................... 22
Introduction to recommender systems ............................................................................................................................... 23
Recommender system models ............................................................................................................................................ 23
Lecture 6 – CLV in a contractual setting ................................................................................................................................... 26
CLV – Definitions ................................................................................................................................................................. 27
CLV - Geometric Model ....................................................................................................................................................... 27
RLV....................................................................................................................................................................................... 28
Heterogeneity and retention rates ...................................................................................................................................... 28
sBG model ........................................................................................................................................................................... 29
Lecture 7 – CLV in a non-contractual setting ............................................................................................................................ 30
Intro ..................................................................................................................................................................................... 31
BGBB .................................................................................................................................................................................... 31
Interpreting results BGBB .................................................................................................................................................... 32
CLV RLV ................................................................................................................................................................................ 32
Extensions ........................................................................................................................................................................... 33
1
,Module 1 – Uncertainty
Module 1 – Uncertainty
N The population (all the customers)
n The (test) sample
m The margin (profit) per response
𝑝̂ The estimate of the response rate
c Cost (of marketing)
p The true population response rate
B Number of bootstrap samples
α Used in the Bayesian prior and indicates the heterogeneity of the customers.
The closer this value is to 0, the more extreme the difference is between
segments that do respond and those who do not. This can be seen as an
indication of the number of successes
β The other parameter in the prior. This can be seen as the number of failures
∝ Meaning it’s related to it. In the context of distributions, it means it does not
always count to 1
𝜎 The standard deviation of the population
s The standard deviation of the sample
1 Sample mean estimate. If you don’t have a
𝑝̂ = ∑𝑖 𝑥𝑖
𝑛
sample, then 𝑝̂ can be based on past data
𝜎2 Standard error
𝑠𝑒 = √
𝑛
𝑝(1−𝑝) The standard error of p
𝑠𝑒(𝑝) = √ 𝑛
N-s The number of failures in the population
s ≈ √𝜇 (1 − 𝜇) An approximation of the s, which is allowed
when 𝜇 is between 0 and 1
𝑦𝐴 ~𝑁(𝑚𝐴 , 𝑠 2 ), The likelihood
𝑚𝐴 ~𝑁(𝜇, 𝜎 2 )), The posterior distribution of group A
𝑐
𝑝=𝑚 The threshold. If the response rate is higher, it
means it’s profitable
𝑎
E[𝑝] = 𝑎+𝑏. The expected response rate. This is based on
the parameters of the prior
1 𝑛 −1 The standard deviation of a normal-normal
𝜎 = √( + ) distribution
𝜎02 𝑠2
𝜇 𝑛 𝑦̅ The mean of a normal-normal distribution
𝜇= 𝜎 2 (𝜎02 + 𝑠2 )
0
2 Optimal sample size of group A if it has a
𝑁 𝑠 2 3 𝑠 2 3 𝑠 2
𝑛𝐴∗ = √ 4 (𝜎) + (4 (𝜎) ) − 4 (𝜎) , normal-normal model
f(𝑝) ∝ 𝑝𝑎−1 (1 − 𝑝)𝑏−1 , The distribution of the prior in the Bayesian
approach
f(𝑝) ∝ 𝑝𝑎+𝑥−1 (1 − 𝑝)𝑏+𝑛−𝑠−1. The distribution of the posterior in the Bayesian
approach
Customer analytics: There’s a shift from a focus on the product to a focus on the customer since
1990. You use customer data and statistical models to make business decisions, such as who to
target, who to test, number of subscriptions and CLV
2
,Module 1 – Uncertainty
Customer lifestyle: Marketing is al about acquiring, developing and retaining customers. The CLV has
3 stages:
1) Customer acquisition: How customers are born or first contact with the firm
2) Customer development: Change in behaviour over time: buying more (up-selling) or different
things (cross-selling)
3) Customer retention: Preventing customers death or churn
Test and roll
Test & roll experiments:
• Test sample (size = n): A subset of customers. After you have send, collect and analyse the
responses, you use the results to decide whether the send to send to the rest of the
population. After the test results are in, you have the option, but not the obligation, to roll
out. Hence, this is an option
• Rollout sample (size = N – n): The rest of the population. You only roll out if the E[rollout
profit] > 0
Expected rollout profits: E[rollout profit] = (N-n)(m*𝑝̂ -c)
Estimate: An estimation of characteristics of the population based on a sample
Option value
Option value: It is assumed that the test provides perfect information. The value is E[profit|test] –
E[profit | no test]. If the test predicts a failure (i.e. E[rollout profit] < 0), you will not roll out and you
will only have the costs of the test. Hence:
• E[profit | test]: The expected profits after you know the test results, e.g. if there’s a 30%
chance of success with, a profit margin m of 50, a response rate p of 0,05 when there’s
success and 0.01 when it’s a failure and the cost c are 1.50. There are 50,000 customers in
total and you have a 10%, than, if it’s a success, the profits are (m*p-c) = (50*0.05-1.5) = 1
per customer and, if it’s a failure (50*0.01-1.5) = -1. You are not gonna roll out if the test
predicts a failure (which is in 70% of the cases), so you only have 5000 losses. The E[profit |
test] is 0.3*(50000) + 0.7*-5000 = 11500. This is also the maximum amount of money you are
willing to pay for the test (i.e. you are willing to pay <11500 to know what the outcome of
the project will be)
• E[profit | no test], First you calculate how much the expected value is of the project per
customer: (50*0.05-1.5)0.3+(50*0.01-1.5)*0.70 = -0.4. This is negative, so you will not do the
project, given you don’t do any testing. Hence, the expected profit is 0. If it’s positive
however, you simply calculate the expected profit per customer * N
Classical uncertainty
Central limit theorem: For large enough samples, distribution of the sample mean is approximately
normal 𝑝̂ ~ N(P, se(p)2). SE is used since it’s about the estimate P. SE is the SD divided by n. The larger
n is, the smaller the SE is and the more accurate the estimation is
Bootstrap: To create new samples with replacement from the original sample, using the same
sample size (e.g. if your sample frame is {0, 2, 4, 6}, with 2 bootstrap samples you can get {0, 0, 0, 4}
and {4, 2, 0, 6}. The goal is to estimate the mean of every sample. With that, you can create a
∑𝑏 1{𝑝̂𝑏 <𝑥}
distribution and you can calculate the average response rate: 𝐵
, meaning the sum of every
estimation of the response rate, which is 1 if it’s larger than x, and 0 otherwise, divided by the
3
, Module 1 – Uncertainty
number of bootstrap samples B. For example, if x is 0.3 and you have 1000 bootstrap samples, of
300∗1+700∗0
which 300 have a lower 𝑝̂ than 0.3. Than the average response rate is 1000
= 0.0226. With
bootstrap aggregation you can reduce the variance of a statistical learning method (e.g. decision
trees)
Bayesian approach
Bayesian approach: This has two steps (categorical data), whereby you have a prior before making a
posterior distribution:
1) Prior distribution: This is the distribution you make before running the actual test. This is a
distribution of the response rate based on previous experience or, when you have no idea,
each value is equally likely (Also called a flat or diffuse prior, which is a uniform distribution).
A diffuse prior carries not much weight since it’s spread across a lot of observations. This
distribution is called the beta: 𝑝 ~ beta(𝑎, 𝑏) and can be calculated as follows: f(𝑝) ∝
𝑝𝑎−1 (1 − 𝑝)𝑏−1 , with a and b being the different groups (e.g. people who respondent and
who did not respond). A flat distribution has an a and a b of 1. If a is smaller than 1, it goes up
at the left hand side, if b is smaller than 1, it goes up at right hand side. If a=b, than it’s
symmetric. The larger a and b become, the more it’s centered in the middle. If you have
some clue about the probabilities, you change the parameters until you have a beta
distribution that fits your beliefs
2) Posterior distribution: This is the distribution you make after running the actual test. It’s an
updated version of the prior distribution, also called the beta-binomial model. It’s the beta
distribution, including the actual observations from the test: p ~ beta(a+s, b+n-s). The more
test observations there are, the less weight the prior distribution gets. To calculate the
posterior distribution, you multiply the likelihood (actual test) with the prior:
posterior ∝ likelihood ∙ prior, which results in f(𝑝) ∝ 𝑝𝑎+𝑠−1 (1 − 𝑝)𝑏+𝑛−𝑠−1 (ignore this,
only the beta distribution is relevant). To compute the differences between two groups,
count the number of times the beta(a+s, b+n-s) of one group is bigger than the other group.
Remember, in a flat distribution, replace a and b with 1
Comparing posteriors
Hold out test: They receive no treatment. You compare this with the active group (respondents who
did receive the treatment)
Normal-normal model Instead of looking at whether someone responded or not, you could also look
at continuous data such as minutes on site or profits. This can be captured with this model, whereby
the likelihood of each respondent (𝑦𝑖 ~ N(𝑚, 𝑠 2 )) is normal, as well as the prior (𝑚 ~ N(𝜇0 , 𝜎02 )).
Hence, the posterior is normal distributed as well: 𝑚 ~ N(𝜇, 𝜎 2 ) according to the central-limit
theorem. You can use the pnorm function in R to calculate the likelihood of having an observation
that is equal or lower than a particular value
A/B-testing: If you subtract the normal distribution of group B with group A, then B is bigger than A if
m is positive
Size of test group
Optimal test size: This is the case when E[Profittest + Profitrollout] is maximal. Large tests have a low
rollout error (low risk), but a lot of people will see the inferior option (opportunity cost). Hence, it’s a
trade-off between learning in the test phase and earning during the roll-out phase, especially when N
is limited. The hypothesis testing is different than the profit-maximising test size, since with
hypothesis testing you really test which one is better (i.e. α = 0.05 and β=0.8). For example, if the
4