Generalization in Adaptive Data Analysis and Holdout Reuse

1. Overview

When training a machine learning model to learn some function {f}, the standard practice is to use a training set to train the parameters of {f}, and then use a holdout set to tune the hyperparameters of the learning algorithm {A} which outputs a function {f: S \in \chi \rightarrow \Upsilon}, such that {f} may generalize well. However, the problem is that when training, the user will often adjust the algorithm {A_m} depending on the analysis of the results from the previous algorithm {A_{m-1}}. The series of dependent algorithm steps ({A_1, A_2, . . . , A_{m-1}, A_{m}}) are known as algorithms that are composed adaptively. Unfortunately, when the same holdout set is used to train the hyperparameters over and over again, the model tends to produce a function {f} which might overfit on the holdout set. In order to combat this issue, we can first use a measure of generalization guarantee called Approximate Max Information[6], which says that given adaptively composed algorithms ({A_{m-1}, A_{m}}), {A_m} generalizes for all but {\beta} (percentage) of the holdout set. We then show how description length and differential privacy are two ways that we can use to guarantee generalization, as well as describe an algorithm called Thresholdout[6], in which we are able to select over the holdout set in order to guarantee generalization.

2. Approximate Max Information

Consider the random variables {P} and {Q}. We define the max divergence to be

\displaystyle D_{\infty}(P\|Q)=log \max_{x \in \chi} {{\mathop{\mathbb P}[P=x]}\over{\mathop{\mathbb P}[Q=x]}} \ \ \ \ \ (1)

and the {\beta}-approximate max divergence to be

\displaystyle D_{\infty}^{\beta} (P\|Q)=log \max_{O \subseteq \chi, \mathop{\mathbb P}[P \in O] > \beta} {{\mathop{\mathbb P}[P \in O]-\beta}\over{\mathop{\mathbb P}[Q \in O]}} \ \ \ \ \ (2)

Definition 1 The max information between two random variables X and Y is

\displaystyle   I_{\infty}(X; Y) = D_{\infty}((X,Y) \| X \times Y) = log \max_{x,y \in \chi} {{\mathop{\mathbb P}[X=x, Y=y]}\over{\mathop{\mathbb P}[X=x]\mathop{\mathbb P}[Y=y]}}\ \ \ \ \ (3)

and the {\beta}-approximate max information is

\displaystyle I_{\infty}^{\beta}(X; Y) = D_{\infty}^{\beta}((X,Y) \| X \times Y)\ \ \ \ \ (4)

where {X \times Y} represents choosing from {X} and {Y} indepedently.

In our case, {X} is {S \in {\mathbb R}^n}, and {Y} is {A(S)}. We can say that {A} has {\beta}-approximate max information of {k} if {I_{\infty}^{\beta}(S; A(S)) \le k}, where {k} is the number of bits that can descibe the algorithm, and therefore {2^k} values.

2.1. Generalization and Composition

Given the definition of approximate max information, we can find the probability that a bad event will occur.

Theorem 2 Let {S \in \chi^n} and {A: \chi^n \rightarrow \Upsilon} where {I_{\infty}^{\beta}(S; A(S))=k}. Then for {O \subseteq \chi^n \times \Upsilon},

\displaystyle  \mathop{\mathbb P}[(S, A(S)) \in O] \le 2^k \cdot \mathop{\mathbb P}[S \times A(S) \in O] + \beta \ \ \ \ \ (5)

We can use this to find the amount of maximum information of when algorithms are composed adaptively between one another.

Lemma 3 Let {A_1: \chi^n \rightarrow \Upsilon_1} such that {I_{\infty}^{\beta1}(S, A_1(S)) \le k_1} and {A_2: \chi^n \times \Upsilon_1 \rightarrow \Upsilon_2} such that {I_{\infty}^{\beta2}(S, A_2(\cdot, y)) \le k_2}. Then if {A_3: \chi^n \rightarrow \Upsilon_2} is {A_3(S)=A_2(S, A_1(S))}, the {\beta}-approximate max information will be {I_{\infty}^{\beta_1 + \beta_2}(S, A_3(S)) \le k_1 + k_2}.

Proof: Let {D} be a distribution over {\chi^n} and {S \sim D}. {\forall O \subseteq \chi^n \times \Upsilon},

\displaystyle  \mathop{\mathbb P}[(S, A_1(S)) \in O] \le 2^{k_1} \cdot \mathop{\mathbb P}[S \times A_1(S) \in O] + \beta_1 \ \ \ \ \ (6)

{\forall Q \subseteq \chi^n \times \Upsilon_2} and {\forall y \in \Upsilon_1}:

\displaystyle  \mathop{\mathbb P}[(S, A_2(S)) \in Q] \le 2^{k_2} \cdot \mathop{\mathbb P}[S \times A_2(S, y) \in Q] + \beta_2 \ \ \ \ \ (7)

define

\displaystyle  \mu(O)=\mathop{\mathbb P}[(S, A_1(S)) \in O] - 2^{k_1}\cdot \mathop{\mathbb P}[S \times A_1(S) \in O] \le \beta_1 \ \ \ \ \ (8)

we then arrive at

\displaystyle  \mathop{\mathbb P}[(S, A_3(S)) \in Q] = \mathop{\mathbb P}[(S, A_2(S, A_1(S))) \in Q]\ \ \ \ \ (9)

\displaystyle  = \sum\limits_{s\in \chi^n,y\in\Upsilon_1}\mathop{\mathbb P}[(s, A_2(s, y)) \in Q]\cdot\mathop{\mathbb P}[S=s, A_1(S)=y]\ \ \ \ \ (10)

\displaystyle  \le \sum\limits_{s\in \chi^n,y\in\Upsilon_1}min((2^{k_2}\mathop{\mathbb P}[s \times A_2(s, y) \in Q] + \beta_2), 1) \\ \cdot(2^{k_1}\cdot\mathop{\mathbb P}[S=s]\cdot\mathop{\mathbb P}[A_1(S)=y]+\mu(s, y))\ \ \ \ \ (11)

note that {\sum\limits_{s\in \chi^n,y\in\Upsilon_1}\mu(s, y) + \beta_2 \le \beta_1 + \beta_2}:

\displaystyle  \le \sum\limits_{s\in \chi^n,y\in\Upsilon_1}min(2^{k_2}\mathop{\mathbb P}[s \times A_2(s, y) \in Q], 1) \\ \cdot(2^{k_1}\cdot\mathop{\mathbb P}[S=s]\cdot\mathop{\mathbb P}[A_1(S)=y]) + \beta_1 + \beta_2\ \ \ \ \ (12)

\displaystyle  \le 2^{k_1+k_2}\cdot(\sum\limits_{s\in \chi^n,y\in\Upsilon_1} \mathop{\mathbb P}[s \times A_2(s, y) \in Q] \cdot\mathop{\mathbb P}[S=s]\cdot\mathop{\mathbb P}[A_1(S)=y]) + \beta_1 + \beta_2\ \ \ \ \ (13)

\displaystyle  = 2^{k_1+k_2}\cdot\mathop{\mathbb P}[S\times A_2(S,A_1(S)) \in Q] + \beta_1 + \beta_2\ \ \ \ \ (14)

by definition of max information,

\displaystyle \Rightarrow I_{\infty}^{\beta_1+\beta_2}\le k_1+k_2\ \ \ \ \ (15)

\Box

By iteratively applying this lemma for every algorithm, we find the following theorem for max information on adaptively composed algorithms:

Theorem 4 If {y_i} is recursively defined as {A_i(S, y_1, . . . , y_n)} such that {A_i} has {\beta}-approximate max information {k_i}, then {A_m} has ({\Sigma_i\beta_i})-approximate max information ({\Sigma_i k_i}).

3. Bounds on Max Information on Description Length

Lemma 5 Let {X, Y} be two random variables in {\chi}. If

\displaystyle \mathop{\mathbb P}_{x\sim p(\chi)}[{{\mathop{\mathbb P}[X=x]}\over{\mathop{\mathbb P}[Y=x]}}\ge 2^k]\le\beta\ \ \ \ \ (16)

then

\displaystyle D_{\infty}^{\beta}(X\|Y)\le k\ \ \ \ \ (17)

Theorem 6 Given a description length {k}, let {A: \chi^n \rightarrow \Upsilon}. {\forall \beta > 0},

\displaystyle  I_{\infty}^{\beta}(S,A(S))\le log(|\Upsilon|/\beta)\ \ \ \ \ (18)

Proof: For {y \in Y}, {y} is bad if {\exists s \in S} such that

\displaystyle {{\mathop{\mathbb P}[Y=y|S=s]}\over{\mathop{\mathbb P}[Y=y]}}\ge|\Upsilon|/\beta\ \ \ \ \ (19)

Let {R_y} be the set of all bad {y}‘s. Given the definition above, we note that for a bad y, {{{\mathop{\mathbb P}[Y=y]}\over{\mathop{\mathbb P}[Y=y|S=s]}}\le\beta/\Upsilon}, and therefore {\mathop{\mathbb P}[Y\in R_y\le\beta}. Let {O=\chi^n \times R_y}. Then

\displaystyle \mathop{\mathbb P}[(S,Y)\in O]=\mathop{\mathbb P}[Y \in O]=\beta\ \ \ \ \ (20)

{\forall (s,y) \notin O},

\displaystyle P[Y=y|S=s]\le|\Upsilon|/\beta\cdot\mathop{\mathbb P}[Y=y]\ \ \ \ \ (21)

therefore

\displaystyle \mathop{\mathbb P}_{(s,y)\sim p(S,Y)}[{{P[Y=y|S=s}\over{P[Y=y]}}\ge |\Upsilon|/\beta]\le\beta\ \ \ \ \ (22)

and by 16,

\displaystyle I_{\infty}^{\beta}(S; Y)\le log(|\Upsilon|/\beta)\ \ \ \ \ (23)

\Box

4. Bounds on Max Information on Differential Privacy

We can show that approximate max-information captures purely {\epsilon}-differentially private algorithms.

Theorem 7 Let {A} be an {\epsilon}-differentially private algorithm. Then {I_{\infty}(S, A(S)) \le log e \cdot \epsilon n}.

Proof: If we have two datasets {S_1} and {S_2}, they differ in at most {n} elements. By the definition of differential privacy,

\displaystyle   \mathop{\mathbb P}[Y=y|S=S_1] \le e^{\epsilon n} \mathop{\mathbb P}[Y=y|S=S_2]\ \ \ \ \ (24)

which, by the definition of divergence is

\displaystyle D_{\infty}(A(S_1)||A(S_2)) \le log e \cdot \epsilon n\ \ \ \ \ (25)

and by 3, we get

\displaystyle  I_{\infty}(S_1; A(S_1)) \le log e \cdot \epsilon n\ \ \ \ \ (26)

\Box

A differentially private algorithm that is ({\epsilon}, {\beta})-differentially private for all of its parameters is a ({\epsilon}, {\beta})- differentially private algorithm. By 26 an algorithm {A_m} obtained by composing ({\epsilon}, {\beta})-algorithms together is ({\sum\limits_{i=1}^{m}\epsilon_i}, {\sum\limits_{i=1}^{m}\beta_i})-differentially private. We can prove the approximate bounds for such an algorithm for datasets that pick a new i.i.d. sample every turn. We first say that functions that have low c-sensitivity are functions {f} such that {f(S_1)-f(S_2) \le c}, where {S_1} and {S_2} differ by one element and {c} is small and {c > 0}. We then define McDiarmid’s inequality for functions of low c-sensitivity.

Lemma 8 (McDiarmid’s Inequality) Let {X_1, . . . , X_n} be independent random variables from {\chi}. Let {f: \chi^n \rightarrow {\mathbb R}} be a function of low c-sensitivity. {\forall \alpha > 0} and {\mu = \mathop{\mathbb E}[f(X_1, . . . , X_n)]},

\displaystyle   \mathop{\mathbb P}[f(X_1, . . . , X_n) - \mu \ge \alpha] \le exp{{{-2\alpha^2}\over{n \cdot c^2}}}\ \ \ \ \ (27)

We can then say the following about the approximate bounds.

Theorem 9 Let {A} be differentially private. Let {S} be randomly sampled over the distribution {P^n} from {\chi^n}. Then {\forall \beta>0}, {I_{\infty}^{\beta}(S; A(S)) \le log \cdot e (\epsilon ^2 n/2 + \epsilon \sqrt{n\cdot ln(2/\beta)/2})} [6].

Proof: Fix {y \in Y}. By Jensen’s inequality,

\displaystyle \mathop{\mathbb E}_{s\sim P^n}[ln(\mathop{\mathbb P}[Y=y, S=s])]\le ln(\mathop{\mathbb E}_{s\sim P^n}[\mathop{\mathbb P}[Y=y, S=s]])= ln(\mathop{\mathbb P}[Y=y])\ \ \ \ \ (28)

and by 24, we have

\displaystyle \mathop{\mathbb P}[Y=y|S=S_1] \le e^{\epsilon n} \mathop{\mathbb P}[Y=y|S=S_2]\ \ \ \ \ (29)

Let {g(s) = ln({{P[Y=y|S=s]}\over{P[Y=y]}})}. By the properties above, {\mathop{\mathbb E}[g(S)] \le 0} and {|g(S_1)-g(S_2)\le \epsilon}, which by 27, implies that {\forall t>0},

\displaystyle \mathop{\mathbb P}[g(s)\ge t]\le e^{-2t^2/(n\epsilon^2)}\ \ \ \ \ (30)

Let {t_i=\epsilon^2n/2 + \epsilon \sqrt{n\cdot ln(2^i/\beta)/2}} (this will help with our bounds later on). Also, let {B_i=\{s|t_it_1\}=\cup_{i\ge 1}B_i}. Then by 30,

\displaystyle \mathop{\mathbb P}[g(S)>t_i]\le exp(-2(\epsilon\sqrt{n}/2+\sqrt{ln(2^i/\beta)/2})^2)\ \ \ \ \ (31)

. By Bayes’ Rule, {\forall s\in B_i},

\displaystyle {{\mathop{\mathbb P}[S=s,Y=y]}\over{\mathop{\mathbb P}[S=s]\mathop{\mathbb P}[Y=y]}}={{\mathop{\mathbb P}[Y=y|S=s]}\over{\mathop{\mathbb P}[Y=y]}}=e^{g(s)}\le e^{t_{i+1}}\ \ \ \ \ (32)

Therefore,

\displaystyle \mathop{\mathbb P}[S\in B_i|Y=y]=\sum\limits_{S\in B_i}\mathop{\mathbb P}[S=s|Y=y]\ \ \ \ \ (33)

\displaystyle \le e^{t_{i+1}}\cdot \sum\limits_{S\in B_i}\mathop{\mathbb P}[S=s]\ \ \ \ \ (34)

\displaystyle \le e^{t_{i+1}}\cdot \mathop{\mathbb P}[g(S)\ge t_i]\ \ \ \ \ (35)

\displaystyle = exp(\epsilon^2n/2+\epsilon\sqrt{n\cdot ln(2^{i+1}/\beta)/2} -2(\epsilon\sqrt{n}/2+\sqrt{ln(2^i/\beta)/2})^2)\ \ \ \ \ (36)

\displaystyle \le exp(-ln(2^i/\beta))=\beta/2^i\ \ \ \ \ (37)

By geometric series,

\displaystyle \mathop{\mathbb P}[S \in B_y|Y=y]\le\sum\limits_{i\ge 1}^{} \beta/2^i\le \beta\ \ \ \ \ (38)

For {O={(s,y)|y\in Y,s\in B_y}},

\displaystyle \mathop{\mathbb P}[(S,Y)\in O]=\mathop{\mathbb P}[(S,Y)\in B_Y]\le\beta\ \ \ \ \ (39)

Since {\forall (s,y) \notin O}

\displaystyle \mathop{\mathbb P}[S=s|Y=y]\le e^{t_1}\cdot\mathop{\mathbb P}[S=s]\ \ \ \ \ (40)

Then

\displaystyle \mathop{\mathbb P}_{(s,y)\sim p(S,Y)}[{{\mathop{\mathbb P}[S=s|Y=y]}\over{\mathop{\mathbb P}[S=s]}}\ge e^{t_1}]\le\beta\ \ \ \ \ (41)

and therefore, by 16,

\displaystyle I_{\infty}^{\beta}\le log(e^{t_1})=log \cdot e (\epsilon ^2 n/2 + \epsilon \sqrt{n\cdot ln(2/\beta)/2})\ \ \ \ \ (42)

\Box

5. Thresholdout

In Thresholdout, the algorithm attempts to provide a function {\phi: \chi \rightarrow [0, 1]} such that {\varepsilon_{S_h}[\phi]-P[\phi]} is small within high probability, where {\varepsilon_{S}(\phi)={{1}\over{n}}\Sigma_{i=1}^{n}\phi(x_i)}. The algorithm is described below.

Algorithm Thresholdout(TrainingSet {S_t}, HoldoutSet {S_h}, threshold {T}, noiserate {\sigma}, budget {B})

  1. sample {\gamma \sim Lap(2\cdot\sigma)}
  2. {\hat{T} \leftarrow T + \gamma}
  3. for each query {\phi}:
    1. if {B < 1} return {\emptyset}
    2. else
      1. sample {\eta \sim Lap(4 \cdot \sigma)}
      2. if {|\varepsilon_{S_h}[\phi] - \varepsilon_{S_t}[\phi]| > \hat{T} + \eta}
        1. sample {\xi \sim Lap(\sigma), \gamma \sim Lap(2\cdot \sigma)}
        2. {B \leftarrow B-1}
        3. {\hat{T} \leftarrow T+\gamma}
        4. return {\varepsilon_{S_h}[\phi] + \xi}
      3. else return {\varepsilon_{S_t}[\phi]}

The algorithm tries to determine the target distribution {P} by using the training set {S_t} as a model. It first checks the distance {|\varepsilon_{S_h}[\phi] - \varepsilon_{S_t}[\phi]|} to see if it is below the threshold {T + \eta}, where {T} is a fixed number such as {0.01}, and {\eta} is a Laplacian noise variable. If the difference is below the threshold (meaning it is good), then the algorithm will output {\varepsilon_{S_t}[\phi]}. Otherwise, it will decrement the budget {B} and output {\varepsilon_{S_h}[\phi] + \xi}. The algorithm continues until the budget runs out.

6. References

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s