**1. Overview **

When training a machine learning model to learn some function , the standard practice is to use a training set to train the parameters of , and then use a holdout set to tune the hyperparameters of the learning algorithm which outputs a function , such that may generalize well. However, the problem is that when training, the user will often adjust the algorithm depending on the analysis of the results from the previous algorithm . The series of dependent algorithm steps () are known as algorithms that are *composed adaptively*. Unfortunately, when the same holdout set is used to train the hyperparameters over and over again, the model tends to produce a function which might overfit on the holdout set. In order to combat this issue, we can first use a measure of generalization guarantee called **Approximate Max Information**[6], which says that given adaptively composed algorithms (), generalizes for all but (percentage) of the holdout set. We then show how description length and differential privacy are two ways that we can use to guarantee generalization, as well as describe an algorithm called **Thresholdout**[6], in which we are able to select over the holdout set in order to guarantee generalization.

**2. Approximate Max Information **

Consider the random variables and . We define the max divergence to be

and the -approximate max divergence to be

Definition 1The max information between two random variables X and Y is

where represents choosing from and indepedently.

In our case, is , and is . We can say that has -approximate max information of if , where is the number of bits that can descibe the algorithm, and therefore values.

** 2.1. Generalization and Composition **

Given the definition of approximate max information, we can find the probability that a bad event will occur.

Theorem 2Let and where . Then for ,

We can use this to find the amount of maximum information of when algorithms are composed adaptively between one another.

Lemma 3Let such that and such that . Then if is , the -approximate max information will be .

*Proof:* Let be a distribution over and . ,

and :

define

we then arrive at

note that :

by definition of max information,

By iteratively applying this lemma for every algorithm, we find the following theorem for max information on adaptively composed algorithms:

Theorem 4If is recursively defined as such that has -approximate max information , then has ()-approximate max information ().

**3. Bounds on Max Information on Description Length **

Lemma 5Let be two random variables in . If

Theorem 6Given a description length , let . ,

*Proof:* For , is bad if such that

Let be the set of all bad ‘s. Given the definition above, we note that for a bad y, , and therefore . Let . Then

,

therefore

and by 16,

**4. Bounds on Max Information on Differential Privacy **

We can show that approximate max-information captures purely -differentially private algorithms.

Theorem 7Let be an -differentially private algorithm. Then .

*Proof:* If we have two datasets and , they differ in at most elements. By the definition of differential privacy,

which, by the definition of divergence is

and by 3, we get

A differentially private algorithm that is (, )-differentially private for all of its parameters is a (, )- differentially private algorithm. By 26 an algorithm obtained by composing (, )-algorithms together is (, )-differentially private. We can prove the approximate bounds for such an algorithm for datasets that pick a new i.i.d. sample every turn. We first say that functions that have low *c-sensitivity* are functions such that , where and differ by one element and is small and . We then define McDiarmid’s inequality for functions of low c-sensitivity.

Lemma 8 (McDiarmid’s Inequality)Let be independent random variables from . Let be a function of low c-sensitivity. and ,

We can then say the following about the approximate bounds.

Theorem 9Let be differentially private. Let be randomly sampled over the distribution from . Then , [6].

*Proof:* Fix . By Jensen’s inequality,

and by 24, we have

Let . By the properties above, and , which by 27, implies that ,

Let (this will help with our bounds later on). Also, let . Then by 30,

. By Bayes’ Rule, ,

Therefore,

By geometric series,

For ,

Since

Then

and therefore, by 16,

**5. Thresholdout **

In Thresholdout, the algorithm attempts to provide a function such that is small within high probability, where . The algorithm is described below.

Algorithm Thresholdout(TrainingSet , HoldoutSet , threshold , noiserate , budget )

- sample
- for each query :
- if return
- else
- sample
- if
- sample
- return

- else return

The algorithm tries to determine the target distribution by using the training set as a model. It first checks the distance to see if it is below the threshold , where is a fixed number such as , and is a Laplacian noise variable. If the difference is below the threshold (meaning it is good), then the algorithm will output . Otherwise, it will decrement the budget and output . The algorithm continues until the budget runs out.

**6. References **