A Practical and Provable Algorithm for Subspace Clustering

1. Introduction

Traditional distance based clustering algorithms such as K-Means perform poorly on high dimensional data because of curse of dimensionality[1]. A common approach to cluster such data is to assume that even though their ambient dimensionality is high, their intrinsic dimensionality is much lower. As an example, consider the dataset of images of faces of different people. Although each face image consists of many pixels, they can be described by much lesser parameters that correspond to the image capturing process. These low dimension representations of data are known as subspaces. Typically, real world data comes from a union of subspaces, and the problem of subspace clustering is to cluster the data into multiple subspaces and find low dimensional subspace that fit each group of points. In this post, we take a look at some of the subspace clustering algorithms that have become popular in recent times. We consider one of them – greedy subspace clustering in detail. We show that under certain conditions, the algorithm guarantees exact clustering.

2. Problem Formulation

We are given {X={x_1, x_2,...,x_N}}, a set of {N} data points in {\mathbb{R}^D}. We assume that the points lie on or near a union of {L} subspaces {S= \bigcup_{i=1}^{L}S_i}. Each subspace {S_i} is of dimension {d_i} where {d_i < D}. Moreover, each subspace is described by a set of vectors that span the subspace. The goal is to recover the subspaces {S} and segment the points according to these subspaces.

3. Algorithms

A common approach to subspace clustering can be described in two steps:

  1. For each data point, find the points that are likely to lie on the subspace as that of the point in question.
  2. Given the above segmentation of points, find the subspaces that contain these groups of points.

We discuss here very briefly some of the algorithms that follow the above approach:

  1. K-subspace: This is an iterative algorithm in which given an initial segmentation, we use classical PCA to determine the subspace for each group. Then, given the subspace, we reassign the points to the nearest subspace. The algorithm then alternates between the two steps, much like K-means algorithm. The number of subspaces and subspace dimension need to be known beforehand.
  2. Mixture of Probabilistic PCA: This assumes that each data point on the subspaces is generated from a probabilistic PCA model. Each subspace is characterized by its PPCA model parameters, along with a mixing probability that determines how likely is a point to come from that subspace. The algorithm then uses Expectation Maximization algorithm to alternate between model estimation and data segmentation.
  3. Locally Linear Manifold Clustering: The algorithm expresses each data point as an affine combination of its neighboring points. The neighbors are determined based on Eucledian distance. However, a point and its nearest neighbors may not always be in the same subspace.
  4. Sparse Subspace Clustering: Similar to above, the algorithm expresses each data point as an affine combination of other points. However the neighbors need not be the nearest points. Instead, each data point is written as the sparse affine or linear combination of all other points. It can be shown that under certain conditions, the sparse combination can only be achieved if and only if all these points lie on the same subspace. The algorithm constructs an affinity matrix based on the sparse representation, and then uses spectral clustering to determine the clusters.
  5. Low Rank Representation: This is similar to sparse subspace clustering with a difference that instead of finding the sparse representation, we seek low rank representations. However, since minimizing the rank is NP-hard, the LRR algorithm finds that representation of data which has the lowest nuclear norm.

4. Greedy Subspace Clustering

In this post, we discuss a practical algorithm for subspace clustering, called Greedy Subspace Clustering and establish guarantees for it. The Greedy Subspace Clustering algorithm is a collection of two algorithms corresponding to the two steps listed in section 3. To determine the neighbors of the points, we use the Nearest Subspace Neighbor (NSN) algorithm. To recover the subspaces using the segmentation, we use Greedy Subspace Recovery (GSR) algorithm.

4.1. Nearest Subspace Neighbor

The NSN algorithm provides a greedy approach to determine the points that are most likely to be on the same subspace. The idea is to maintain a collection of the neighbor points found so far. At any time step, the point most likely to be the neighbor is the one that is closest to the subspace spanned by the current neighbor points. The algorithm is described below:

Input: {X={x_1, x_2,...,x_N}}, a set of {N} data points in {\mathbb{R}^D}; Number of required neighbors {K}; Maximum subspace dimension {k_{max}}

Output: A neighborhood matrix {W \in \{0,1\}^{NxN}}

  1. Normalize the magnitudes of each data point.

    \displaystyle x_i= \frac{x_i}{||x_i||_2}

    for all { i \in [N] }

  2. For each data point {i}:
    1. Initialize a set of neighbor points {\mathcal{I}_i}, initially containing the point itself.

      \displaystyle {\mathcal{I}_i=\{i\}}

    2. Until the set of neighbor points contains the number of required neighbors {K}, iteratively add the point that is closest to the current subspace. To do this, we determine the projection of all the points that are not yet the neighbors, onto the subspace spanned by the current neighbors. We add the point with the maximum projection to the set of current neighbors. The projection of a point {x} on subspace {\mathcal{U}} is calculated as:

      \displaystyle Proj_{\mathcal{U}}x=arg \ min_{u \in \mathcal{U}}||x-u||_2

      for {k=1:K} do:

      if {k \leq k_{max}} then:

      {\mathcal{U}=span\{x_j: j \in \mathcal{I}_i\}} (subspace so far)

      end if

      {j^*=arg \ max_{j \in [N]\setminus \mathcal{I}_i} ||Proj_{\mathcal{U}}j||_2}

      {\mathcal{I}_i=\mathcal{I}_i \bigcup {j^*}}

      end for 

    3. Lastly, set the entry {w_{ij}} in the neighborhood matrix {W} for point {i} as:

      \displaystyle W_{ij}= 1 \ if \ j \in \mathcal{I}_i \ or \ x_j \in \mathcal{U}, \forall j \in [N]

Although the naive implementation of the algorithm is costly, using a few tricks, the running time of the algorithm can be brought down to {O(KDN^2)}. This running time is much lower than that of convex optimization based methods such as Sparse subspace clustering or Low rank representation.

4.2. Greedy Subspace Recovery

The GSR algorithm is used to determine whether the neighbors of a point returned by the NSN algorithm are indeed correct or not, i.e. if they span a true subspace. The intuition behind this algorithm is that if the neighbors span a true subspace, then there must be many other points that should lie on that span. To that effect, the algorithm constructs subspaces for each point using the neighborhood matrix returned by NSN. At each time step, it picks the current best subspace estimate. The best subspace estimate is the one which has the most number of points on or near it. This is repeated until {L} subspaces are picked. The algorithm is described below:

Input: {X={x_1, x_2,...,x_N}}, a set of {N} data points in {\mathbb{R}^D}; A neighborhood matrix {W \in \{0,1\}^{NxN}}; Error bound {\epsilon}

Output: Estimated subspaces {S= \bigcup_{i=1}^{L}S_i}. Estimated Labels {\hat{w_1},...,\hat{w_N}}

  1. Normalize the magnitudes of each data point.

    \displaystyle x_i= \frac{x_i}{||x_i||_2}

    for all { i \in [N] }

  2. Construct subspace for each point using their neighbors: 

    \displaystyle \mathcal{W}_i=Top.d\{x_j: W_{ij}=1\}, \forall i \in [N]

    {Top.d} is used to denote the {d}-dimensional subspaces of the vector set of the neighbors. Essentially, it is the first {d} left singular vectors of the matrix with columns as the vectors in the set. This forms our estimate of the subspace spanned by a point and its neighbors.

  3. We start with a set {\mathcal{I}} which initially contains all the points.

    \displaystyle \mathcal{I}=\{[N]\}

  4. In the subsequent steps, we iterate through all the points in the set {\mathcal{I}}. We pick the best subspace as the one that has the maximum number of points on or near it. We count a point {x} as being on or near the subspace {\mathcal{W}} if {||Proj_{\mathcal{U}}x||_2 \geq 1- \epsilon}. In the next iteration, we remove the points on the subspace that was picked from the set {\mathcal{I}} and repeat until we have picked {L} subspaces. 

    while {\mathcal{I}\neq \phi} do:

    {i^*=arg \ max_{i \in \mathcal{I}} \Sigma_{j=1}^{N}\mathbb{I}\{||Proj_{\mathcal{W}_i}x_j||_2 \geq 1- \epsilon}\}

    {\hat{S_l} = \hat{\mathcal{W}_{i^*}}}

    {\mathcal{I}=\mathcal{I} \setminus \{j:||Proj_{\mathcal{W}_{i^*}}x_j||_2 \geq 1- \epsilon\ \} }

    end while 

  5. Lastly, we label the points according to their subspace, i.e. a point gets the label of the subspace it is closest to:

    \displaystyle \hat{w}_i=arg \ max_{l \in [L]} ||Proj_{\hat{S}_l}x_i||_2, \forall i \in [N]

The algorithm guarantees exact recovery of the subspace, if for each subspace we have at least one point with all its correct neighbors in the neighborhood matrix.

5. Theoretical Guarantees

For simplification, the authors assume that all the subspaces have the same dimension {d}, and have the same number of points on or near them, {n}. The authors establish theoretical guarantees for two standard noiseless models- i) Fully random model and ii) Semi-random model. We consider here the fully random model. The analysis for the other model is similar.

Guarantee under Fully Random Model

A fully random model is the one in which the subspaces are drawn uniformly at random in iid fashion, and the points are also generated randomly in iid fashion. We have the following theorem:

Theorem 1 Let there be {L} subspaces, each of dimension {d}, generated under fully random model and each subspace has {n} points on it. Further, let {n} be polynomial in {d}. If, for any constants {C_1,C_2>0} and {\delta }, :

\displaystyle \frac{n}{d}>C_1(\log \frac{ne}{d \delta})^2, \ \frac{d}{D}<\frac{C_2 \log n}{\log (ndL^{\delta -1})} \ \ \ \ \ (1)

then NSN+GSR (with K={k_{max}}=d) clusters the points exactly with probability at least {1-\frac{3L \delta}{1-\delta}}.

Exact recovery can only take place if the subspaces are “easy” to cluster. The two conditions in the theorem are formally related to the difficulty of the clustering problem. The first condition implies that the number of points lying on a subspace should be high enough so that they look like belonging to a cluster. This is required for NSN to find the correct neighbors. The second condition says that the dimension of the subspaces should be significantly lesser than the ambient dimensionality. Intuitively, if many subspaces have high dimension, then it becomes difficult to distinguish between them. The condition, however is dependent upon n. As one keeps increasing n, the number of points start becoming dense on the subspace, and they become easier to identify.

We observe that both the conditions are in accordance with our intuition. The authors note that these conditions are weaker than the ones for algorithms such as SSC. We show here a proof sketch of the theorem before formally proving it.

Proof Sketch: First we show that GSR clusters the points exactly with probability one if NSN finds the correct neighbors. This is based on the statistical property of the fully random model. Next we show that NSN does indeed find the correct neighbors. Since the analysis is independent of the subspaces, we only consider the subspace {S_1}. Furthermore, we focus on only one point {x_1} on the subspace {S_1} (labeled as {w_1=1}), and prove that NSN finds correct neighbors for this point under the conditions listed in (1). We say that NSN collects a correct neighbor if at every step {k} the uncollected points that are on subspace {S_1} are closer to the subspace {\mathcal{U}} constructed by NSN at that step, than the points on different subspaces. Mathematically:

\displaystyle \max_{j:w_j=1, j \notin \mathcal{I}_1}||Proj_{\mathcal{U}}x_j||_2 > \max_{j:w_j\neq 1, j \notin \mathcal{I}_1}||Proj_{\mathcal{U}}x_j||_2 \ \ \ \ \ (2)

for every {k=1....K}. Since it is difficult to analyze (2) as the terms are inter dependent, we make use of an Oracle algorithm (similar to NSN but easier to analyze). Then we user lemmas on bounds on order statistics of random projections to find lower bound the projection length of correct points and upper bound the projection length of incorrect points. Finally, we show that the maximum length of projection of incorrect points is smaller than the minimum length of projection of correct points under certain conditions.

Proof: The proof of the theorem consists of the following steps:

Step One

The first step is to show that GSR clusters the points exactly if NSN computes correct neighbors.

  1. The fully random model has the property that the probability of a point lying on any {d} dimensional subspace in {\mathbb{R}^D}, apart from the true subspaces is 0. Hence we can claim that amongst all possible {d} dimensional subspaces in {\mathbb{R}^D}, the true subspaces are ones that contain the most number of points on or near them.
  2. Hence, it follows that for each subspace, if we have a point that has {d-1} correct neighbors on the same subspace, then the subspace is almost surely one of the true subspace and will be picked by GSR.
  3. All that is left to do is for NSN to find at least {d-1} correct neighbors for each point and GSR will then recover the true subspaces and cluster the points exactly.

Step Two

Next we need to prove that NSN does indeed find the correct neighbors. Essentially, we need to show that the conditions in (1) lead to (2) with high probability. We will follow the arguments stated in the proof sketch and construct an Oracle algorithm that is easier to analyze. The Oracle algorithm knows the true label of each data point, and returns either Success or Failure depending upon the outcome of NSN. In other words, the algorithm returns Failure if and only if NSN picks an incorrect neighbor for {x_1}.

1. Oracle Algorithm for {x_1}

Input: {X={x_1, x_2,...,x_N}}, a set of {N} data points in {\mathbb{R}^D}; Number of required neighbors {K=d}; Maximum subspace dimension {k_{max}=d}

Output: SUCCESS or FAILURE of NSN

  1. Initialize a set of neighbor points {\mathcal{I}_1}, initially containing the point itself.

    \displaystyle {\mathcal{I}_1^{(1)}=\{1\}}

  2. Until the set of neighbor points contains the number of required neighbors {d}, the algorithm determines the point {j_k^*} that is closest to the current subspace {\mathcal{V}_k}, from amongst the points that are on the subspace {S_1}. The algorithm also determines the point closest to the current subspace {\mathcal{V}_k} from amongst the points not on the subspace {S_1}. If the latter is more closer to {\mathcal{V}_k} than the former, algorithm raises FAILURE, and SUCCESS otherwise. We add {j_k^*} to the set {\mathcal{I}}.

for {k=1:d} do:

{\mathcal{V}_k=span\{x_j: j \in I_1^{(k)}\}} (subspace so far)

{j^*_k=arg \max_{j \in [N] :w_j=1, j \notin \mathcal{I}_1^{(k)}}||Proj_{\mathcal{V}_k}x_j||_2}

if  {\max_{j \in [N] :w_j=1, j \notin \mathcal{I}_1^{(k)}}||Proj_{\mathcal{V}_k}x_j||_2 \leq \max_{j \in [N] :w_j \neq 1}||Proj_{\mathcal{V}_k}x_j||_2}  then:

Return FAILURE

end if

{\mathcal{I}_1^{(k+1)}=\mathcal{I}_1^{(k)} \bigcup {j^*_k}}

end for

Return SUCCESS

We note that Oracle returns FAILURE if and only if NSN fails. Hence, it is perfectly valid to analyze the success condition of Oracle algorithm which is:

\displaystyle ||Proj_{\mathcal{V}_k}x_{j_k^*}||_2\ > max_{j \in [N] :w_j \neq 1}||Proj_{\mathcal{V}_k}x_j||_2, \ \forall k=1...d

2. In this step , we establish lower bound on the projection of correct points. First, note that in order to compute {||Proj_\mathcal{U}x_j||_2}, one can store subspace {\mathcal{U}} in the form of a matrix {U} whose columns form an orthonormal basis for {\mathcal{U}}. Then {||Proj_\mathcal{U}x_j||_2=||U^Tx_j||_2}. So, let {D_i \in \mathbb{R}^{D*d}} be a matrix whose columns form orthonormal basis for {S_1}. Further, let {V_k \in \mathbb{R}^{d*k}} be such that columns of {D_1V_k} form orthogonal basis for {\mathcal{V}_k}. Then:

\displaystyle ||Proj_{\mathcal{V}_k}x_{j_k^*}||_2=||V_k^TD_1^Tx_j||_2=||V_k^TD_1^TD_1y_{j_k^*}||_2=||V_k^Ty_{j_k^*}||_2

Hence {||V_k^Ty_{j_k^*}||_2} is the quantity we need to establish the bounds on. Since it is difficult to analyze this quantity because of the interdependent terms in it, we analyze another quantity that is stochastically dominated by {||V_k^Ty_{j_k^*}||_2} but easier to analyze. A quantity {A} stochastically dominates quantity {B} if the probability that {A} is greater than {t} is larger than the probability that {B} is greater than {t} for any non zero constant {t}. We choose that quantity to be {P_{k,(m)}} which is defined as the {m'th} largest norm of the projections onto a {k}-dimensional subspace of {n-1} iid uniform random vectors sampled from a {d}– dimensional unit ball {S^{(d-1)}}. We have the following lemma on stochastic domination:

Lemma 2 {||V_k^Ty_{j_k^*}||_2} stochastically dominates {P_{k,(k)}}, i.e.,

\displaystyle Pr\{||V_k^Ty_{j_k^*}||_2\geq t \} \geq Pr\{P_{k,(k)} \geq t \}

for any {t \geq 0}

Using the above lemma, we can instead establish lower bound on {P_{k,(k)}} and that would lower bound {||V_k^Ty_{j_k^*}||_2} as well. We use the following lemma on order statistics of random projections to establish bounds on {P_{k,(m)}}.

Lemma 3 Let {x_1...x_n} be iid points drawn randomly and uniformly from a {d}-dimensional unit ball {S^{(d-1)}}. Further, let {z_{(n-m+1)}} be the {m'th} largest value of \{{z_i=||Ax_i||_2,\ 1 \leq i \leq n \}} where {A \in \mathbb{R}^{k*d}}. We have the following bounds on {z_{(n-m+1)}}:

  1. Let the rows of A be orthonormal to each other. For any {\alpha \in (0,1)}, and {C>0}, if

    \displaystyle n-m+1 \geq Cm \Big (\log \frac{ne}{m\delta}\Big)^2 \ \ \ \ \ (3)

    then we have:

    \displaystyle z_{(n-m+1)}^2 \ > \frac{k}{d} + \frac{1}{d}.\min \Big\{ 2 \log \Big(\frac{n-m+1}{Cm(\log \frac{ne}{m \delta})^2}\Big), \alpha \sqrt{d-k}\Big\} \ \ \ \ \ (4)

    with probability {\geq 1-\delta^m }

  2. For any k*d matrix A,

    \displaystyle z_{(n-m+1)} \ < \frac{||A||_F}{\sqrt{d}} + \frac{||A||_2}{\sqrt{d}}.\Big(\sqrt{2\pi}+\sqrt{2 \log \frac{ne}{m \delta}}\Big) \ \ \ \ \ (5)

 

Before we apply the lemma to lower bound {||V_k^Ty_{j_k^*}||_2}, we need to show that the condition (3) in the lemma is satisfied by our model. From the first condition in (1) with {C_1=C+1}, we have:

\displaystyle n-d \ > Cd ( \log \frac{ne}{d\delta})^2

and also

\displaystyle n-k \ > Ck ( \log \frac{ne}{k\delta})^2 \ \ \forall k=1...d-1

We use the above conditions and the fact that for {k>d/2}, {||V_k^Ty_{j_k^*}||_2} stochastically dominates {P_{d/2,(k)}}, together with lemma 2(a) and union bound to establish the following lower bound:

\displaystyle ||V_k^Ty_{j_k^*}||_2^2 \geq \frac{k}{2d} + \frac{1}{d}.\min \Big\{ 2 \log \Big(\frac{n-k+1}{Cm(\log \frac{ne}{k \delta})^2}\Big), \alpha \sqrt{d/2}\Big\} \ \ \ \ \ (6)

{\forall k=1...d-1} simultaneously with probability {\geq 1-\frac{\delta}{1-\delta}}.

3. Next, we establish upper bounds on the projection of incorrect points. Using the same arguments as in above step, we can write {||Proj_{\mathcal{V}_k}x_j||_2} as {||V_k^TD_1^Tx_j||_2}, which is the quantity that we need to bound. From lemma 2(b), we have the upper bound on this as:

\displaystyle \leq \frac{||V_k^TD_1^T||_F}{\sqrt{D}} + \frac{||V_k^TD_1^T||_2}{\sqrt{D}}.\sqrt{2 \log \frac{n(L-1)e}{ \delta/d}}

Using the fact that {||D_1V_k||_F=\sqrt{k}} and {||D_1V_k||_1 \leq 1}, we can write the above inequality as:

\displaystyle \leq \sqrt{\frac{k}{D}} + \sqrt{\frac{2}{D} \log \frac{ndLe}{ \delta}} \ \ \ \ \ (7)

Using union bound, the above holds true for every {k=1....d-1} with probability {\geq 1-\delta }.

4. In the final step, we need to show that (6) {>} (7) for every {k=1....d-1}:

\displaystyle \sqrt{\frac{k}{2d} + \frac{1}{d}.\min \Big\{ 2 \log \Big(\frac{n-k+1}{Cm(\log \frac{ne}{k \delta})^2}\Big), \alpha \sqrt{d/2}\Big\}} \ > \ \sqrt{\frac{k}{D}} + \sqrt{\frac{2}{D} \log \frac{ndLe}{ \delta}}

By manipulation of terms and using the fact that {n} is polynomial in {d}, it can be shown that the above condition is equivalent to (1).

References

  1. Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. “On the surprising behavior of distance metrics in high dimensional spaces.” ICDT. Vol. 1. 2001.
  2. Park, Dohyung, Constantine Caramanis, and Sujay Sanghavi. “Greedy subspace clustering.” Advances in Neural Information Processing Systems. 2014.
  3. Elhamifar, Ehsan, and René Vidal. “Sparse subspace clustering.” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
  4. Soltanolkotabi, Mahdi, Ehsan Elhamifar, and Emmanuel J. Candes. “Robust subspace clustering.” The Annals of Statistics 42.2 (2014): 669-699.
  5. Liu, Guangcan, Zhouchen Lin, and Yong Yu. “Robust subspace segmentation by low-rank representation.” Proceedings of the 27th international conference on machine learning (ICML-10). 2010.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s