Efficiently Learning Ising Models on Arbitrary Graphs [STOC’15]

The paper [1] presented in this blog is from Professor Guy Bresler at Massachusetts Institute of Technology. It was originally presented at the 47th Annual Symposium on the Theory of Computing (STOC, 2015).

The paper mainly talks about how to reconstruct a graph with an Ising model given i.i.d data samples. Without any restrictive constraints on the model, it remains unclear how to avoid exhaustive search over all possible neighborhood for each node during the reconstruction process. The paper proposed a simple algorithm which can greedily learn the graph with bounded-degree with running time quadratic in the number of nodes. The parameters of the model are necessarily bounded by some constants. The intuition of the algorithm is based on the underlying structural property: any node has at least one neighbor which has a large influence on it if its neighborhood is non-empty.

1. Ising Model

The Ising model is a classical graphical model built upon an undirected graph {G=\left ( V,E \right )} with pairwise interactions. The probability distribution of variable collection {\mathbf{x}} can be described as the following equation.

\displaystyle  p\left ( X;\theta \right )= \textup{exp}\left \{\sum_{i\in V}\theta_ix_i+\sum_{i,j\in V}\theta_{i,j}x_ix_j -\Phi\left ( \theta \right ) \right \}  \ \ \ \ \ (1)

Each node {i\in V} represents a binary variable {x_i\in\left\{-1,+1\right\}}, then the overall variable {X\in\left \{ -1,+1 \right \}^p} given {\left | V \right |=p}. The parameters of the model {\theta=\left \{ \theta_{i,j} \right \}_{\left \{ i,j \right \}\in E}\cup \left \{ \theta_{i} \right \}_{i\in V}\in \mathbb{R}^{E\cup V}} consists of edge couplings and node-wise external fields. {\Phi\left ( \theta \right )} is the normalization function to enable the sum of probability is one. Here the restrictive model is in consideration to guarantee the proposed algorithm works in a reasonable time. {0\leq\alpha \leq\left | \theta_{i,j} \right |\leq\beta}, {\left | \theta_{i} \right |\leq h}, and {\left | \partial i \right |\leq d}. {\partial i} denotes the neighborhood of node {i} in the graph, which is equivalent to the degree of {i}. These bounds are really necessary. For example, if the parameter {\left | \theta_{i,j} \right |} is very small or arbitrarily close to zero, the existence of the corresponding edge may not be able to be determined from the small variance of the probability. {\left | \theta_{i,j} \right |} cannot be too large either, otherwise the identification of edges with small parameters requires a lot of sample data.

According to the equation 1, the probability distribution meets the Markov property, which is that the node {i} is conditionally independent of {S} given the neighborhood {\partial i}. Here {S\subseteq V\setminus \left \{ i,\partial i \right \}}. Such graphical model is known as Markov random field on an undirected graph, which is very useful to describe high dimensional probability distribution. It is a rich class of model, which is tractable when {\partial i} is bounded in particular.

The goal is to learn graph structure and estimate the parameters given the Ising model and data samples. The focus of the paper is on learning the graph exactly with probability larger than {1-\epsilon}. The data samples {X^{\left ( 1 \right )},X^{\left ( 2 \right )},...,X^{\left ( n \right )}} are i.i.d.. The parameter estimation accually is an easier problem given the graph structure.

Problem With a probability larger than {1-\epsilon}, learn the graph {G} exactly under the Ising model, given the restrictive bounds on node degree and parameters {\theta}, and {n} i.i.d data samples.

Other important issues include: (1) how much data is needed for exact reconstruction (sample complexity); (2) What is the running time in terms of {p,n} (computational complexity). The paper introduced a strong argument about the computational complexity for learning the Ising model with any bounded degree. And we will see the proof of the theorem later in the blog. The notation {\widetilde{O}\left ( \cdot \right )} automatically contains a factor {\textup{log}\left ( p \right )} and a constant depending (doubly exponentially) on {d}. Exponential dependence on {d} is unavoidable as the number of samples must itself increase exponentially in {d}, but doubly exponential may not be the optimal solution.

Theorem 1 For any bounded Ising model, with the probability larger than {1-\epsilon} the graph can be learned using {n=\textup{log}\left ( p \right )} samples in time {\widetilde{O}\left ( p^2 \right )}.

2. Previous Approaches

Approach 1 The most intuitive approach of graph reconstruction is using pairwise correlations. In order to find neighbors of {i}, we choose the node {j} which have a large mutual information {I\left ( X_i;X_j \right )} with {i} (greater than a threshold {\tau}) as one of neighbors.

Two major challenges in the task make the intuitive approach implausible for reconstruction of general graphs. Again, there is no assumption on the graph structures except the limited degree (or size of neighborhood).

Challenge 1: The neighboring nodes in the graph can be independent because of the indirect path cancellations. As shown in the figure, the marginal distribution of {X_1} and {X_2} can be independent given special setting of {\theta_{12},\theta_{13},\theta_{23}}, though {X_1} and {X_2} are conditional independent given {X_3}. The path through {\theta_3} can exactly cancel the correlation from the direct path between {X_1} and {X_2}. In this case, the intuitive approach will miss neighbors for node {i}.

Challenge 2: The faraway nodes can be highly dependent because multiple paths connects them. In the figure, {X_1} and {X_2} can be highly dependent with each other since there are many paths connecting them, though each path only introduces small correlation. In this case, the intuitive approach will miss remote neighbors.

These two challenges cause the big troubles for the algorithms based on local pairwise marginals. In order to fully reconstruct the underlying graph, some researches proposed the following baseline approach. It is in general for all graph structures but with large amount of running time, which may not be realistic.

Approach 2 The exhaustive search approach checks every possible neighborhood. For each node {i}, test all possible neighborhood {U} with size at most {d}. The Markov property is examined for the rest of nodes during each test. The neighborhood {\partial i} is determined when Markov property is satisfied for {i} and all other nodes.

If the {\left | \partial i \right |\leq d}, the algorithm learns the graph with probability larger than {1-\epsilon} in time {\widetilde{O}\left ( p^{2d+1} \right )} [2]. Selecting candidate neighborhood introduces {\widetilde{O}\left ( p^{d} \right )} running time complexity because the neighborhood candidates are randomly sampled {d} nodes from total {p-1} nodes. During each test phase, the conditional independence of all subsets (from the rest nodes) with {i} given candidate {\partial i} needs to be checked for Markov property. It brought in another {\widetilde{O}\left ( p^d \right )} running time. Since we have {p} nodes in total, then the overall time complexity would be {\widetilde{O}\left ( p^{2d+1} \right )}.

In fact, if we have some prior knowledge about the graph structure, then the structure learning problem becomes a lot more easier. For example, we assume the underlying graph structure is a tree structure, then the structure learning problems becomes easy. Because in trees, we don’t have the two challenges. There is no cancelling path between two neighbors since all pairs of nodes are connected with only single path. Therefore neighbors are always correlated. The correlation between two faraway nodes is decayed with respect to the graph distance so the second challenge does not exist. Then we have the following theorem.

Theorem 2 The tree can be learned with the probability larger than {1-\epsilon} in quadratic time {\widetilde{O}\left ( p^{2} \right )}.

Approach 3 Chow and Liu [3] proposed a simple algorithm to reconstruct the tree structure: at each step, the maximum mutual information pair to the tree is added unless forming a loop (similar to the maximum spanning tree algorithm).

Another example is general graphs with correlation decay in terms of graph distance. The correlation between node {u} and {v} is defined as {d_C\left ( u,v\right )}.

\displaystyle  d_C\left ( u,v\right )=\sum_{x_u,x_v}\left | P\left ( X_u=x_u,X_v=x_v \right )-P\left ( X_u=x_u\right )P\left ( X_v=x_v \right ) \right | \ \ \ \ \ (2)

The graph distance is denoted as {d\left ( u,v\right )}. For some positive constant {\alpha}, the correlation decay is {d_C\left ( u,v\right )\leq \textup{exp}\left ( -\alpha d\left ( u,v \right ) \right )}. Under such assumptions, the majors challenges can be avoided. Various models satisfy correlation decay property when the coupling {\theta_{i,j}} is weak intuitively. The running time complexity is proved in the following theorem [2]. Because we can choose a small constant number {D} that every vertex farther than {D} will have small correlation with initial node {i}. Then the searching range for the neighborhood is bounded and the computation burden is reduced.

Theorem 3 Under the correlation decay assumption, with the probability larger than {1-\epsilon} the graph can be learned using {n=\textup{log}\left ( p \right )} samples in time {\widetilde{O}\left ( p^2 \right )}.

3. Structural Property

In the paper, the author proposed that a simple algorithm learns any Ising model and meets theorem 1. In order to design the algorithm, one key structure property is introduced for the general graphs, described as the influential neighbor lemma.

Lemma 4 Conditioned on an arbitrary set of nodes {S}, if {\partial i} does not belong to {S} then node {i} has at least one neighbor with large influence.

In order to measure influence, the algorithm use the concept conditional influence of a variable on another variable. For {u,i\in V}, node subset {S\subseteq V\setminus \left \{ u,i \right \}}, and configuration {x_s\in \left \{ -,+ \right \}^S}, define

\displaystyle  \nu_{u|i;x_S}:= P\left ( X_u=+|X_i=+,X_S=x_S\right )-P\left ( X_u=+|X_i=-,X_S=x_S\right ). \ \ \ \ \ (3)

Another measure call average conditional influence is reached by a weighted average of {\left | \nu_{u|i,x_S} \right |} over the random configuration {X_S}.

\displaystyle  \nu_{u|i,x_S}^{avg}:=E\left (\lambda_i\left ( X_S \right )\left | \nu_{u|i,x_S} \right | \right ) \ \ \ \ \ (4)

The weights are given by the function

\displaystyle  \lambda_i\left ( X_S \right )=2\cdot P\left ( X_i=+|X_S=x_S\right )P\left ( X_i=-|X_S=x_S\right ). \ \ \ \ \ (5)

Based on the Markov property, the influence is zero for non-neighbors conditional on neighbors.

\displaystyle  \nu_{u|i,x_S}=0\textup{ }\forall i\in V\setminus\left \{ u,i \right \} \textup{ if } \partial u\subseteq S. \ \ \ \ \ (6)

The average influence is not equally weighted because some events happen more likely than others. We give small weights for the event probabilities {P\left(i+\right)} or {P\left(i-\right)} are very small since these events are almost fixed given {x_s}. So those events should not have effect on the influence measure. Such weight scheme gives us the guarantee of anti-concentration in the expectation.

4. A Simple Algorithm

In this section, we are going to introduce the simple algorithm to reconstruct the graph using the Ising model.

{0.} for each node {i}, find the neighborhood {\partial i}.

{1.} Form Pseudo-Neighborhood

{1.1} Start with a empty set {S}

{1.2} Repeat:

{1.2.1} Add node {u} to {S} with maximum influence {\widehat{\nu}_{u|i,x_S}^{avg}}

{1.2.2} Stop when {\widehat{\nu}_{u|i,x_S}^{avg}\leq\tau^*} for all {u}

{2.} Prune Non-Neighbors

{2.1} If {\widehat{\nu}_{u|i,x_S}^{avg}\leq\tau^*}, remove {u} from {S}.

{\tau^*} is a large constant {O\left ( \frac{\alpha^2\delta^{O\left (d\right )}}{d\beta} \right )}, and {\delta = \frac{1}{2}\textup{exp}\left ( -2 \left (\beta d+h \right )\right )}. The empirical conditional influence {\widehat{\nu}_{u|i,x_S}^{avg}} is computed based on statistics of the data samples.

Now we will prove the correctness of the algorithm. In the process of generating pseudo-neighborhood, each true neighbors will be added into set {S} in different steps according to the influential neighbor lemma. Although the pseudo-neighborhood contains some non-neighbors potentially, we will add more nodes with large influence into {S} as long as {S} does not fully cover the true neighborhood {\partial i} ({S} and {\partial i} could have overlap). Then we can make sure that the pseudo neighborhood covers the true neighborhood. When pruning the non-neighbors, the node {u} will be removed if {\widehat{\nu}_{u|i,x_S}^{avg}} is less than the large constant {\tau^*}. Because {\nu_{u|i,x_S}^{avg}=0} for the non-neighbor {u}. Furthermore, {\nu_{u|i,x_S}^{avg}} and {\widehat{\nu}_{u|i,x_S}^{avg}} are very close given enough data samples. Thus {\widehat{\nu}_{u|i,x_S}^{avg}} has very small values (less than {\tau^*}) for non-neighbor {u}, and {u} will eventually removed from {S}. At last, only true neighbors {\partial i} are left in {S}.

The pseudo-neighborhood {S} is bounded by {\frac{1}{\tau^*}}. Since {X_i} is binary variable and we introduce {\tau^*} bits information at each iteration, by the definition of entropy, mutual information and chain rule, we have

\displaystyle  1\geq H\left ( X_i \right )\geq I\left ( X_i,X_S \right )=\sum_{k=1}{\left | S \right |I\left ( X_i,X_{jk}|X_{j1},...,X_{j\left (k-1 \right )} \right )}\geq\left | S \right |\tau^*. \ \ \ \ \ (7)

Therefore {\left | S \right |\leq\frac{1}{\tau^*}}.

5. Proof of Influential Neighbor Lemma

The influential neighbor lemma is described as following.

Lemma 5 Conditioned on an arbitrary set of nodes {S}, if {\partial i} does not belong to {S} then node {i} has at least one neighbor with large influence.

The proof here is sketched in the case when {S} is the empty set. And the same strategy can be applied when {S} in not empty.

We are going to prove the following inequality.

\displaystyle  \sum_{i\in\partial u}\nu_{u|i}^{avg}=\sum_{i\in\partial u}\theta_{ui}\lambda_i\nu_{u|i}\geq\left \| \theta_{u\partial u} \right \|_12\tau^* \ \ \ \ \ (8)

{\theta_{u\partial u}} is the vector of pairwise parameters between {u} and its neighbors {\partial u}. Once we prove the inequality, the immediately we have that at least one neighbor {i} makes {\left |\nu_{u|i}^{avg} \right |\geq2\tau^*}. Now we consider the conditional influence.

\displaystyle  \nu_{u|i} = P\left ( u+|i+ \right )-P\left ( u+|i- \right ) \ \ \ \ \ (9)

\displaystyle  = \sum_{x_{\partial u\setminus i}}\left [ P\left ( u+|i+,x_{\partial u\setminus i} \right )P\left ( x_{\partial u\setminus i}|i+ \right )-P\left ( u+|i-,x_{\partial u\setminus i} \right ) P\left ( x_{\partial u\setminus i}|i- \right )\right ] \ \ \ \ \ (10)

\displaystyle  = \sum_{x_{\partial u\setminus i}}\left [ P\left ( u+|i+,x_{\partial u\setminus i} \right )\frac{P\left ( x_{\partial u\setminus i},i+ \right )}{P\left ( i+ \right )}-P\left ( u+|i-,x_{\partial u\setminus i} \right ) \frac{P\left ( x_{\partial u\setminus i},i- \right )}{P\left ( i- \right )}\right ] \ \ \ \ \ (11)

\displaystyle  =\sum_{x_{\partial u}}\left [ P\left ( u+|x_{\partial u} \right )\frac{x_iP\left ( x_{\partial u} \right )}{P\left ( x_i \right )} \right ] \ \ \ \ \ (12)

\displaystyle  =E_{\partial u}\left [ P\left ( u+|x_{\partial u} \right )\frac{x_i}{P\left ( x_i \right )} \right ] \ \ \ \ \ (13)

We use the following representations.

\displaystyle  s_i=\frac{1}{2}\left ( \frac{1}{P\left ( i+ \right )}-\frac{1}{P\left ( i- \right )} \right ) \ \ \ \ \ (14)

\displaystyle  t_i=\frac{1}{2}\left ( \frac{1}{P\left ( i+ \right )}+\frac{1}{P\left ( i- \right )} \right )=\frac{1}{\lambda_i} \ \ \ \ \ (15)

Since {\frac{x_i}{P\left ( x_i \right )}\in\left \{ \frac{1}{P\left ( i+ \right )},-\frac{1}{P\left ( i- \right )} \right \}}, then we have the following equations.

\displaystyle  \frac{s_i}{t_i}=-Ex_i \ \ \ \ \ (16)

\displaystyle  \frac{1}{t_i}\left ( \frac{x_i}{P\left ( x_i \right )}-s_i \right )=x_i \ \ \ \ \ (17)

\displaystyle  \lambda_i\frac{x_i}{P\left ( x_i \right )}=x_i-Ex_i \ \ \ \ \ (18)

Therefore,

\displaystyle  \lambda_i\nu_{u|i}=E_{\partial u}\left [ P\left ( u+|x_{\partial u} \right )\left ( x_i-Ex_i \right ) \right ] \ \ \ \ \ (19)

\displaystyle  =\frac{1}{2}E_{\partial u}\left [ \left (2P\left ( u+|x_{\partial u} \right )-1 \right )\left ( x_i-Ex_i \right ) \right ] \ \ \ \ \ (20)

\displaystyle  =\frac{1}{2}E_{\partial u}\left [ \textup{tanh}\left (Z \right )\left ( x_i-Ex_i \right ) \right ], \ \ \ \ \ (21)

where {Z=\sum_{i\in\partial u}\theta_{ui}x_i+\theta_u} based the definition of graphical model. Then,

\displaystyle  \sum_{i\in\partial u}\theta_{ui}\lambda_i\nu_{u|i}=\frac{1}{2}E_{\partial u}\left [ \textup{tanh}\left (Z \right )\theta_{ui}\left ( x_i-Ex_i \right ) \right ] \ \ \ \ \ (22)

\displaystyle  =\frac{1}{2}E_{\partial u}\left [ \textup{tanh}\left (Z \right )\left ( Z-\mu \right ) \right ], \ \ \ \ \ (23)

where {\mu=EZ}.

Now we want to have anti-concentration for {Z} from the anti-concentration for weighted sums of i.i.d. uniform {\pm 1} random variable.

Lemma 6 (Erdos [4]) Let {w_1,w_2,...,w_r} be real numbers with {\left | w_i \right |\geq\alpha} for all {i}. Let {I=\left \{ t\in\mathbb{R}:t_0\leq t-\alpha\leq t+\alpha\right \}} be an interval of length {2\alpha}. If {\varepsilon_1,\varepsilon_2,...,\varepsilon_r} is uniformly distributed on {\left \{ -1,+1 \right \}^r}, then

{P\left ( w\cdot \varepsilon \in I \right )\leq\frac{1}{2^r}\cdot \begin{pmatrix} r\\ \left \lfloor \frac{r}{2} \right \rfloor \end{pmatrix}\leq \frac{1}{2}}

From the lemma, we can infer that it is not possible to have a large probability mass within the interval of {2\alpha} under randomness. Therefore we add small amount of randomness into the conditional probability {P\left ( i+|x_{\partial u}\right )}, then we can apply anti-concentration lemma to compute the lower bound of {E\left [ Z-\mu \right ]}. With further manipulation, we can bound the term {\textup{tanh}\left (Z\right )} and prove the following.

\displaystyle  \sum_{i\in\partial u}\theta_{ui}\lambda_i\nu_{u|i}\geq d\beta\geq\left \| \theta_{u\partial u} \right \|_12\tau^* \ \ \ \ \ (24)

Proof is done. (Pleae check the paper for more technical details.)

6. What about Higher Order Interaction?

In this section, we consider the higher order interaction model (e.g. hypergraph). For example,

\displaystyle  P\left ( x \right )=\frac{1}{Z}\textup{exp}\left ( X_1X_2X_3 \right ). \ \ \ \ \ (25)

With specific setting of {X_1,X_2,X_3} (for instance, {X_3} is the noisy parity of {X_1} and {X_2}), then {X_1,X_2,X_3} could be independent with each other. Then the influential neighbor lemma no longer holds in this case.

However, the mutual information {I\left ( X_i;X_u,X_v|X_S \right )\geq\Delta} (or the average conditional influence) for a large constant {\Delta}. Then we still can apply the same algorithm for the higher order but search over pair of nodes instead of single node. The running time will be {\widetilde{O}\left ( p^3 \right )}. Analogously, we can have the following theorem.

Theorem 7 With {r}th order interactions, we can learn graph with probability larger than {1-\epsilon} using {n=f\left ( d \right )\textup{log}\left ( p \right )} samples in time {\widetilde{O}\left ( p^r \right )}.

7. Discussion

The paper proposed a simple algorithm which can learn Ising models efficiently without correlation decay (successfully avoid the usage of exhaustive search). And a new information-theoretic structural property (influential neighbor lemma) was introduced without prior knowledge for graph structures.

There are also some open questions left:

(1). Can we avoid the doubly exponential term (with respect to {d}) in the complexity of running time? Maybe single exponential term is possible.

(2). How to extend the results to other graphical models, e.g. Potts model?

(3). What if we relax the restriction placed on the model, e.g. {d,\alpha,\beta,h}?

References

[1] Bresler, G., 2015, June. Efficiently learning Ising models on arbitrary graphs. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (pp. 771-782). ACM. https://arxiv.org/abs/1411.6156

[2] Bresler, G., Mossel, E. and Sly, A., 2008. Reconstruction of Markov random fields from samples: Some observations and algorithms. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques (pp. 343-356). Springer Berlin Heidelberg. https://arxiv.org/abs/0712.1402

[3] Chow, C. and Liu, C., 1968. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3), pp.462-467.

[4] Erdos, P., 1945. On a lemma of Littlewood and Offord. Bulletin of the American Mathematical Society, 51(12), pp.898-902.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s