# Generalization in Adaptive Data Analysis and Holdout Reuse

1. Overview

When training a machine learning model to learn some function ${f}$, the standard practice is to use a training set to train the parameters of ${f}$, and then use a holdout set to tune the hyperparameters of the learning algorithm ${A}$ which outputs a function ${f: S \in \chi \rightarrow \Upsilon}$, such that ${f}$ may generalize well. However, the problem is that when training, the user will often adjust the algorithm ${A_m}$ depending on the analysis of the results from the previous algorithm ${A_{m-1}}$. The series of dependent algorithm steps (${A_1, A_2, . . . , A_{m-1}, A_{m}}$) are known as algorithms that are composed adaptively. Unfortunately, when the same holdout set is used to train the hyperparameters over and over again, the model tends to produce a function ${f}$ which might overfit on the holdout set. In order to combat this issue, we can first use a measure of generalization guarantee called Approximate Max Information[6], which says that given adaptively composed algorithms (${A_{m-1}, A_{m}}$), ${A_m}$ generalizes for all but ${\beta}$ (percentage) of the holdout set. We then show how description length and differential privacy are two ways that we can use to guarantee generalization, as well as describe an algorithm called Thresholdout[6], in which we are able to select over the holdout set in order to guarantee generalization.

2. Approximate Max Information

Consider the random variables ${P}$ and ${Q}$. We define the max divergence to be

$\displaystyle D_{\infty}(P\|Q)=log \max_{x \in \chi} {{\mathop{\mathbb P}[P=x]}\over{\mathop{\mathbb P}[Q=x]}} \ \ \ \ \ (1)$

and the ${\beta}$-approximate max divergence to be

$\displaystyle D_{\infty}^{\beta} (P\|Q)=log \max_{O \subseteq \chi, \mathop{\mathbb P}[P \in O] > \beta} {{\mathop{\mathbb P}[P \in O]-\beta}\over{\mathop{\mathbb P}[Q \in O]}} \ \ \ \ \ (2)$

Definition 1 The max information between two random variables X and Y is

$\displaystyle I_{\infty}(X; Y) = D_{\infty}((X,Y) \| X \times Y) = log \max_{x,y \in \chi} {{\mathop{\mathbb P}[X=x, Y=y]}\over{\mathop{\mathbb P}[X=x]\mathop{\mathbb P}[Y=y]}}\ \ \ \ \ (3)$

and the ${\beta}$-approximate max information is

$\displaystyle I_{\infty}^{\beta}(X; Y) = D_{\infty}^{\beta}((X,Y) \| X \times Y)\ \ \ \ \ (4)$

where ${X \times Y}$ represents choosing from ${X}$ and ${Y}$ indepedently.

In our case, ${X}$ is ${S \in {\mathbb R}^n}$, and ${Y}$ is ${A(S)}$. We can say that ${A}$ has ${\beta}$-approximate max information of ${k}$ if ${I_{\infty}^{\beta}(S; A(S)) \le k}$, where ${k}$ is the number of bits that can descibe the algorithm, and therefore ${2^k}$ values.

2.1. Generalization and Composition

Given the definition of approximate max information, we can find the probability that a bad event will occur.

Theorem 2 Let ${S \in \chi^n}$ and ${A: \chi^n \rightarrow \Upsilon}$ where ${I_{\infty}^{\beta}(S; A(S))=k}$. Then for ${O \subseteq \chi^n \times \Upsilon}$,

$\displaystyle \mathop{\mathbb P}[(S, A(S)) \in O] \le 2^k \cdot \mathop{\mathbb P}[S \times A(S) \in O] + \beta \ \ \ \ \ (5)$

We can use this to find the amount of maximum information of when algorithms are composed adaptively between one another.

Lemma 3 Let ${A_1: \chi^n \rightarrow \Upsilon_1}$ such that ${I_{\infty}^{\beta1}(S, A_1(S)) \le k_1}$ and ${A_2: \chi^n \times \Upsilon_1 \rightarrow \Upsilon_2}$ such that ${I_{\infty}^{\beta2}(S, A_2(\cdot, y)) \le k_2}$. Then if ${A_3: \chi^n \rightarrow \Upsilon_2}$ is ${A_3(S)=A_2(S, A_1(S))}$, the ${\beta}$-approximate max information will be ${I_{\infty}^{\beta_1 + \beta_2}(S, A_3(S)) \le k_1 + k_2}$.

Proof: Let ${D}$ be a distribution over ${\chi^n}$ and ${S \sim D}$. ${\forall O \subseteq \chi^n \times \Upsilon}$,

$\displaystyle \mathop{\mathbb P}[(S, A_1(S)) \in O] \le 2^{k_1} \cdot \mathop{\mathbb P}[S \times A_1(S) \in O] + \beta_1 \ \ \ \ \ (6)$

${\forall Q \subseteq \chi^n \times \Upsilon_2}$ and ${\forall y \in \Upsilon_1}$:

$\displaystyle \mathop{\mathbb P}[(S, A_2(S)) \in Q] \le 2^{k_2} \cdot \mathop{\mathbb P}[S \times A_2(S, y) \in Q] + \beta_2 \ \ \ \ \ (7)$

define

$\displaystyle \mu(O)=\mathop{\mathbb P}[(S, A_1(S)) \in O] - 2^{k_1}\cdot \mathop{\mathbb P}[S \times A_1(S) \in O] \le \beta_1 \ \ \ \ \ (8)$

we then arrive at

$\displaystyle \mathop{\mathbb P}[(S, A_3(S)) \in Q] = \mathop{\mathbb P}[(S, A_2(S, A_1(S))) \in Q]\ \ \ \ \ (9)$

$\displaystyle = \sum\limits_{s\in \chi^n,y\in\Upsilon_1}\mathop{\mathbb P}[(s, A_2(s, y)) \in Q]\cdot\mathop{\mathbb P}[S=s, A_1(S)=y]\ \ \ \ \ (10)$

$\displaystyle \le \sum\limits_{s\in \chi^n,y\in\Upsilon_1}min((2^{k_2}\mathop{\mathbb P}[s \times A_2(s, y) \in Q] + \beta_2), 1) \\ \cdot(2^{k_1}\cdot\mathop{\mathbb P}[S=s]\cdot\mathop{\mathbb P}[A_1(S)=y]+\mu(s, y))\ \ \ \ \ (11)$

note that ${\sum\limits_{s\in \chi^n,y\in\Upsilon_1}\mu(s, y) + \beta_2 \le \beta_1 + \beta_2}$:

$\displaystyle \le \sum\limits_{s\in \chi^n,y\in\Upsilon_1}min(2^{k_2}\mathop{\mathbb P}[s \times A_2(s, y) \in Q], 1) \\ \cdot(2^{k_1}\cdot\mathop{\mathbb P}[S=s]\cdot\mathop{\mathbb P}[A_1(S)=y]) + \beta_1 + \beta_2\ \ \ \ \ (12)$

$\displaystyle \le 2^{k_1+k_2}\cdot(\sum\limits_{s\in \chi^n,y\in\Upsilon_1} \mathop{\mathbb P}[s \times A_2(s, y) \in Q] \cdot\mathop{\mathbb P}[S=s]\cdot\mathop{\mathbb P}[A_1(S)=y]) + \beta_1 + \beta_2\ \ \ \ \ (13)$

$\displaystyle = 2^{k_1+k_2}\cdot\mathop{\mathbb P}[S\times A_2(S,A_1(S)) \in Q] + \beta_1 + \beta_2\ \ \ \ \ (14)$

by definition of max information,

$\displaystyle \Rightarrow I_{\infty}^{\beta_1+\beta_2}\le k_1+k_2\ \ \ \ \ (15)$

$\Box$

By iteratively applying this lemma for every algorithm, we find the following theorem for max information on adaptively composed algorithms:

Theorem 4 If ${y_i}$ is recursively defined as ${A_i(S, y_1, . . . , y_n)}$ such that ${A_i}$ has ${\beta}$-approximate max information ${k_i}$, then ${A_m}$ has (${\Sigma_i\beta_i}$)-approximate max information (${\Sigma_i k_i}$).

3. Bounds on Max Information on Description Length

Lemma 5 Let ${X, Y}$ be two random variables in ${\chi}$. If

$\displaystyle \mathop{\mathbb P}_{x\sim p(\chi)}[{{\mathop{\mathbb P}[X=x]}\over{\mathop{\mathbb P}[Y=x]}}\ge 2^k]\le\beta\ \ \ \ \ (16)$

then

$\displaystyle D_{\infty}^{\beta}(X\|Y)\le k\ \ \ \ \ (17)$

Theorem 6 Given a description length ${k}$, let ${A: \chi^n \rightarrow \Upsilon}$. ${\forall \beta > 0}$,

$\displaystyle I_{\infty}^{\beta}(S,A(S))\le log(|\Upsilon|/\beta)\ \ \ \ \ (18)$

Proof: For ${y \in Y}$, ${y}$ is bad if ${\exists s \in S}$ such that

$\displaystyle {{\mathop{\mathbb P}[Y=y|S=s]}\over{\mathop{\mathbb P}[Y=y]}}\ge|\Upsilon|/\beta\ \ \ \ \ (19)$

Let ${R_y}$ be the set of all bad ${y}$‘s. Given the definition above, we note that for a bad y, ${{{\mathop{\mathbb P}[Y=y]}\over{\mathop{\mathbb P}[Y=y|S=s]}}\le\beta/\Upsilon}$, and therefore ${\mathop{\mathbb P}[Y\in R_y\le\beta}$. Let ${O=\chi^n \times R_y}$. Then

$\displaystyle \mathop{\mathbb P}[(S,Y)\in O]=\mathop{\mathbb P}[Y \in O]=\beta\ \ \ \ \ (20)$

${\forall (s,y) \notin O}$,

$\displaystyle P[Y=y|S=s]\le|\Upsilon|/\beta\cdot\mathop{\mathbb P}[Y=y]\ \ \ \ \ (21)$

therefore

$\displaystyle \mathop{\mathbb P}_{(s,y)\sim p(S,Y)}[{{P[Y=y|S=s}\over{P[Y=y]}}\ge |\Upsilon|/\beta]\le\beta\ \ \ \ \ (22)$

and by 16,

$\displaystyle I_{\infty}^{\beta}(S; Y)\le log(|\Upsilon|/\beta)\ \ \ \ \ (23)$

$\Box$

4. Bounds on Max Information on Differential Privacy

We can show that approximate max-information captures purely ${\epsilon}$-differentially private algorithms.

Theorem 7 Let ${A}$ be an ${\epsilon}$-differentially private algorithm. Then ${I_{\infty}(S, A(S)) \le log e \cdot \epsilon n}$.

Proof: If we have two datasets ${S_1}$ and ${S_2}$, they differ in at most ${n}$ elements. By the definition of differential privacy,

$\displaystyle \mathop{\mathbb P}[Y=y|S=S_1] \le e^{\epsilon n} \mathop{\mathbb P}[Y=y|S=S_2]\ \ \ \ \ (24)$

which, by the definition of divergence is

$\displaystyle D_{\infty}(A(S_1)||A(S_2)) \le log e \cdot \epsilon n\ \ \ \ \ (25)$

and by 3, we get

$\displaystyle I_{\infty}(S_1; A(S_1)) \le log e \cdot \epsilon n\ \ \ \ \ (26)$

$\Box$

A differentially private algorithm that is (${\epsilon}$, ${\beta}$)-differentially private for all of its parameters is a (${\epsilon}$, ${\beta}$)- differentially private algorithm. By 26 an algorithm ${A_m}$ obtained by composing (${\epsilon}$, ${\beta}$)-algorithms together is (${\sum\limits_{i=1}^{m}\epsilon_i}$, ${\sum\limits_{i=1}^{m}\beta_i}$)-differentially private. We can prove the approximate bounds for such an algorithm for datasets that pick a new i.i.d. sample every turn. We first say that functions that have low c-sensitivity are functions ${f}$ such that ${f(S_1)-f(S_2) \le c}$, where ${S_1}$ and ${S_2}$ differ by one element and ${c}$ is small and ${c > 0}$. We then define McDiarmid’s inequality for functions of low c-sensitivity.

Lemma 8 (McDiarmid’s Inequality) Let ${X_1, . . . , X_n}$ be independent random variables from ${\chi}$. Let ${f: \chi^n \rightarrow {\mathbb R}}$ be a function of low c-sensitivity. ${\forall \alpha > 0}$ and ${\mu = \mathop{\mathbb E}[f(X_1, . . . , X_n)]}$,

$\displaystyle \mathop{\mathbb P}[f(X_1, . . . , X_n) - \mu \ge \alpha] \le exp{{{-2\alpha^2}\over{n \cdot c^2}}}\ \ \ \ \ (27)$

We can then say the following about the approximate bounds.

Theorem 9 Let ${A}$ be differentially private. Let ${S}$ be randomly sampled over the distribution ${P^n}$ from ${\chi^n}$. Then ${\forall \beta>0}$, ${I_{\infty}^{\beta}(S; A(S)) \le log \cdot e (\epsilon ^2 n/2 + \epsilon \sqrt{n\cdot ln(2/\beta)/2})}$ [6].

Proof: Fix ${y \in Y}$. By Jensen’s inequality,

$\displaystyle \mathop{\mathbb E}_{s\sim P^n}[ln(\mathop{\mathbb P}[Y=y, S=s])]\le ln(\mathop{\mathbb E}_{s\sim P^n}[\mathop{\mathbb P}[Y=y, S=s]])= ln(\mathop{\mathbb P}[Y=y])\ \ \ \ \ (28)$

and by 24, we have

$\displaystyle \mathop{\mathbb P}[Y=y|S=S_1] \le e^{\epsilon n} \mathop{\mathbb P}[Y=y|S=S_2]\ \ \ \ \ (29)$

Let ${g(s) = ln({{P[Y=y|S=s]}\over{P[Y=y]}})}$. By the properties above, ${\mathop{\mathbb E}[g(S)] \le 0}$ and ${|g(S_1)-g(S_2)\le \epsilon}$, which by 27, implies that ${\forall t>0}$,

$\displaystyle \mathop{\mathbb P}[g(s)\ge t]\le e^{-2t^2/(n\epsilon^2)}\ \ \ \ \ (30)$

Let ${t_i=\epsilon^2n/2 + \epsilon \sqrt{n\cdot ln(2^i/\beta)/2}}$ (this will help with our bounds later on). Also, let ${B_i=\{s|t_it_1\}=\cup_{i\ge 1}B_i}$. Then by 30,

$\displaystyle \mathop{\mathbb P}[g(S)>t_i]\le exp(-2(\epsilon\sqrt{n}/2+\sqrt{ln(2^i/\beta)/2})^2)\ \ \ \ \ (31)$

. By Bayes’ Rule, ${\forall s\in B_i}$,

$\displaystyle {{\mathop{\mathbb P}[S=s,Y=y]}\over{\mathop{\mathbb P}[S=s]\mathop{\mathbb P}[Y=y]}}={{\mathop{\mathbb P}[Y=y|S=s]}\over{\mathop{\mathbb P}[Y=y]}}=e^{g(s)}\le e^{t_{i+1}}\ \ \ \ \ (32)$

Therefore,

$\displaystyle \mathop{\mathbb P}[S\in B_i|Y=y]=\sum\limits_{S\in B_i}\mathop{\mathbb P}[S=s|Y=y]\ \ \ \ \ (33)$

$\displaystyle \le e^{t_{i+1}}\cdot \sum\limits_{S\in B_i}\mathop{\mathbb P}[S=s]\ \ \ \ \ (34)$

$\displaystyle \le e^{t_{i+1}}\cdot \mathop{\mathbb P}[g(S)\ge t_i]\ \ \ \ \ (35)$

$\displaystyle = exp(\epsilon^2n/2+\epsilon\sqrt{n\cdot ln(2^{i+1}/\beta)/2} -2(\epsilon\sqrt{n}/2+\sqrt{ln(2^i/\beta)/2})^2)\ \ \ \ \ (36)$

$\displaystyle \le exp(-ln(2^i/\beta))=\beta/2^i\ \ \ \ \ (37)$

By geometric series,

$\displaystyle \mathop{\mathbb P}[S \in B_y|Y=y]\le\sum\limits_{i\ge 1}^{} \beta/2^i\le \beta\ \ \ \ \ (38)$

For ${O={(s,y)|y\in Y,s\in B_y}}$,

$\displaystyle \mathop{\mathbb P}[(S,Y)\in O]=\mathop{\mathbb P}[(S,Y)\in B_Y]\le\beta\ \ \ \ \ (39)$

Since ${\forall (s,y) \notin O}$

$\displaystyle \mathop{\mathbb P}[S=s|Y=y]\le e^{t_1}\cdot\mathop{\mathbb P}[S=s]\ \ \ \ \ (40)$

Then

$\displaystyle \mathop{\mathbb P}_{(s,y)\sim p(S,Y)}[{{\mathop{\mathbb P}[S=s|Y=y]}\over{\mathop{\mathbb P}[S=s]}}\ge e^{t_1}]\le\beta\ \ \ \ \ (41)$

and therefore, by 16,

$\displaystyle I_{\infty}^{\beta}\le log(e^{t_1})=log \cdot e (\epsilon ^2 n/2 + \epsilon \sqrt{n\cdot ln(2/\beta)/2})\ \ \ \ \ (42)$

$\Box$

5. Thresholdout

In Thresholdout, the algorithm attempts to provide a function ${\phi: \chi \rightarrow [0, 1]}$ such that ${\varepsilon_{S_h}[\phi]-P[\phi]}$ is small within high probability, where ${\varepsilon_{S}(\phi)={{1}\over{n}}\Sigma_{i=1}^{n}\phi(x_i)}$. The algorithm is described below.

Algorithm Thresholdout(TrainingSet ${S_t}$, HoldoutSet ${S_h}$, threshold ${T}$, noiserate ${\sigma}$, budget ${B}$)

1. sample ${\gamma \sim Lap(2\cdot\sigma)}$
2. ${\hat{T} \leftarrow T + \gamma}$
3. for each query ${\phi}$:
1. if ${B < 1}$ return ${\emptyset}$
2. else
1. sample ${\eta \sim Lap(4 \cdot \sigma)}$
2. if ${|\varepsilon_{S_h}[\phi] - \varepsilon_{S_t}[\phi]| > \hat{T} + \eta}$
1. sample ${\xi \sim Lap(\sigma), \gamma \sim Lap(2\cdot \sigma)}$
2. ${B \leftarrow B-1}$
3. ${\hat{T} \leftarrow T+\gamma}$
4. return ${\varepsilon_{S_h}[\phi] + \xi}$
3. else return ${\varepsilon_{S_t}[\phi]}$

The algorithm tries to determine the target distribution ${P}$ by using the training set ${S_t}$ as a model. It first checks the distance ${|\varepsilon_{S_h}[\phi] - \varepsilon_{S_t}[\phi]|}$ to see if it is below the threshold ${T + \eta}$, where ${T}$ is a fixed number such as ${0.01}$, and ${\eta}$ is a Laplacian noise variable. If the difference is below the threshold (meaning it is good), then the algorithm will output ${\varepsilon_{S_t}[\phi]}$. Otherwise, it will decrement the budget ${B}$ and output ${\varepsilon_{S_h}[\phi] + \xi}$. The algorithm continues until the budget runs out.

6. References