When training a machine learning model to learn some function , the standard practice is to use a training set to train the parameters of , and then use a holdout set to tune the hyperparameters of the learning algorithm which outputs a function , such that may generalize well. However, the problem is that when training, the user will often adjust the algorithm depending on the analysis of the results from the previous algorithm . The series of dependent algorithm steps () are known as algorithms that are composed adaptively. Unfortunately, when the same holdout set is used to train the hyperparameters over and over again, the model tends to produce a function which might overfit on the holdout set. In order to combat this issue, we can first use a measure of generalization guarantee called Approximate Max Information, which says that given adaptively composed algorithms (), generalizes for all but (percentage) of the holdout set. We then show how description length and differential privacy are two ways that we can use to guarantee generalization, as well as describe an algorithm called Thresholdout, in which we are able to select over the holdout set in order to guarantee generalization.
2. Approximate Max Information
Consider the random variables and . We define the max divergence to be
and the -approximate max divergence to be
and the -approximate max information is
where represents choosing from and indepedently.
In our case, is , and is . We can say that has -approximate max information of if , where is the number of bits that can descibe the algorithm, and therefore values.
2.1. Generalization and Composition
Given the definition of approximate max information, we can find the probability that a bad event will occur.
Theorem 2 Let and where . Then for ,
We can use this to find the amount of maximum information of when algorithms are composed adaptively between one another.
Lemma 3 Let such that and such that . Then if is , the -approximate max information will be .
Proof: Let be a distribution over and . ,
we then arrive at
note that :
by definition of max information,
By iteratively applying this lemma for every algorithm, we find the following theorem for max information on adaptively composed algorithms:
Theorem 4 If is recursively defined as such that has -approximate max information , then has ()-approximate max information ().
3. Bounds on Max Information on Description Length
Theorem 6 Given a description length , let . ,
Proof: For , is bad if such that
Let be the set of all bad ‘s. Given the definition above, we note that for a bad y, , and therefore . Let . Then
and by 16,
4. Bounds on Max Information on Differential Privacy
We can show that approximate max-information captures purely -differentially private algorithms.
Theorem 7 Let be an -differentially private algorithm. Then .
which, by the definition of divergence is
and by 3, we get
A differentially private algorithm that is (, )-differentially private for all of its parameters is a (, )- differentially private algorithm. By 26 an algorithm obtained by composing (, )-algorithms together is (, )-differentially private. We can prove the approximate bounds for such an algorithm for datasets that pick a new i.i.d. sample every turn. We first say that functions that have low c-sensitivity are functions such that , where and differ by one element and is small and . We then define McDiarmid’s inequality for functions of low c-sensitivity.
We can then say the following about the approximate bounds.
Theorem 9 Let be differentially private. Let be randomly sampled over the distribution from . Then , .
Proof: Fix . By Jensen’s inequality,
and by 24, we have
Let . By the properties above, and , which by 27, implies that ,
Let (this will help with our bounds later on). Also, let . Then by 30,
. By Bayes’ Rule, ,
By geometric series,
and therefore, by 16,
In Thresholdout, the algorithm attempts to provide a function such that is small within high probability, where . The algorithm is described below.
Algorithm Thresholdout(TrainingSet , HoldoutSet , threshold , noiserate , budget )
- for each query :
- if return
- else return
The algorithm tries to determine the target distribution by using the training set as a model. It first checks the distance to see if it is below the threshold , where is a fixed number such as , and is a Laplacian noise variable. If the difference is below the threshold (meaning it is good), then the algorithm will output . Otherwise, it will decrement the budget and output . The algorithm continues until the budget runs out.