# Distributed Probabilistic Learning

1. Abstract

Traditional computer vision algorithms, particularly those that exploit various probabilistic and learning-based approaches, are often formulated in centralized settings. However, modern computational settings are becoming increasingly characterized by networks of peer-to-peer connected devices, with local data processing abilities. A number of distributed algorithms have been proposed to address the problems such as calibration, pose estimation, tracking, object and activity recognition in large camera networks [1],[2].

One critical challenge in distributed data analysis includes dealing with missing data. In camera networks, different nodes will only have access to a partial set of data features because of varying camera views or object movement. For instance, object points used for SfM may be visible only in some cameras and only in particular object poses. As a consequence, different nodes will be frequently exposed to missing data. However, most current distributed data analysis methods are algebraic in nature and cannot seamlessly handle such missing data.

In this work we present an approach to estimation and learning of generative probabilistic models in a distributed context where certain sensor data can be missing. In particular, we show how traditional centralized models, such as probabilistic PCA (PPCA) [3] and missing-data PPCA [4], Bayesian PCA (BPCA) [4] can be learned when the data is distributed across a network of sensors. We demonstrate the utility of this approach on the problem of distributed affine structure from motion. Our experiments suggest that the accuracy of the learned probabilistic structure and motion models rivals that of traditional centralized factorization methods while being able to handle challenging situations such as missing or noisy observations.

2. Distributed Probabilistic Learning

We propose a distributed consensus learning approach for parametric probabilistic models with latent variables that can effectively deal with missing data. The goal of the network of sensors is to learn a single consensus probabilistic model (e.g., 3D object structure) without ever resorting to a centralized data pooling and centralized computation. Let $\mathbf{X} = \{ \mathbf{x}_{n} | \mathbf{x}_{n} \in \mathcal{R}^{D} \}$ be a set of iid multivariate data points with the corresponding latent variables  $\mathbf{Z} = \{ \mathbf{z}_{n} | \mathbf{z}_{n} \in \mathcal{R}^{M} \}$, $n = 1 … N$. Our model is a joint density defined on $(\mathbf{x}_{n}, \mathbf{z}_{n})$ with a global parameter $\theta$,

$(\mathbf{x}_{n}, \mathbf{z}_{n}) \sim p(\mathbf{x}_{n}, \mathbf{z}_{n} | \theta),$

with $p(\mathbf{X}, \mathbf{Z} | \theta) = \prod_n p(\mathbf{x}_{n}, \mathbf{z}_{n} | \theta)$, as depicted in Figure 1-a. In this general model, we can find an optimal global parameter $\hat{\theta}$ (in a MAP sense) by applying standard EM learning. It is important to point out that each posterior density estimate at point $n$ depends solely on the corresponding measurement $\mathbf{x}_{n}$ and does not depend on any other $\mathbf{x}_{k}, k \neq n$, hence is decentralized. To consider the distributed counterpart of this model, let $G = (V, E)$ be an undirected connected graph with vertices $i, j \in V$ and edges $e_{ij} = (i, j) \in E$ connecting the two vertices. Each $i$-th node is directly connected with 1-hop neighbors in $\mathcal{B}_{i} = \{ j; e_{ij} \in E \}$. Suppose the set of data samples at $i$-th node is $\mathbf{X}_{i} = \{ \mathbf{x}_{in}; n = 1, … , N_{i} \}$, where $\mathbf{x}_{in} \in \mathcal{R}^{D}$ is $n$-th measurement vector and $N_{i}$ is the number of samples collected in $i$-th node. Likewise, we define the latent variable set for node $i$ as $\mathbf{Z}_{i} = \{ \mathbf{z}_{in}; n = 1, … , N_{i} \}$.

Learning the model parameter would be decentralized if each node had its own independent parameter $\theta_i$. Still, the centralized model can be equivalently defined using the set of local parameters, with an additional constraint on their consensus, $\theta_1 = \theta_2 = \cdots = \theta_{|V|}$. This is illustrated in Figure 1-b where the local node models are constrained using ties defined on the underlying graph. The simple consensus tying can be more conveniently defined using a set of auxiliary variables $\rho_{ij}$, one for each edge $e_{ij}$ (Figure 1-c). This now leads to the final distributed consensus learning formulation, similar to [5]:

\begin{align*} \label{dpm_opt1} \hat{\mathbf{\theta}} = \arg\min_{ \{ \theta_{i} : i \in V \} } & -\log p( \mathbf{X} | \mathbf{\theta}, G) \\ s.t. &\quad \theta_{i} = \rho_{ij}, \rho_{ij} = \theta_{j}, i \in V, j \in \mathcal{B}_{i}. \end{align*}

This is a constrained optimization task that can be solved in a principal manner using the Alternating Direction Method of Multipliers (ADMM) [6]. ADMM iteratively, in a block-coordinate fashion, solves $\max_{\lambda} \min_{\theta} \mathcal{L}(\cdot)$ on the augmented Lagrangian

\begin{align*} \label{dpm_opt2} \mathcal{L}( \mathbf{\theta}, \rho, \lambda ) &= -\log p( \mathbf{X} | \theta_{1}, \theta_{2}, … , \theta_{|V|}, G) \\ &\quad + \sum_{i \in V} \sum_{j \in \mathcal{B}_{i}} \left\{ \lambda_{ij1}^{\text{T}} ( \theta_{i} – \rho_{ij} ) + \lambda_{ij2}^{\text{T}} ( \rho_{ij} – \theta_{j} ) \right\} \nonumber \\ &\quad + \frac{ \eta }{ 2 } \sum_{i \in V} \sum_{j \in \mathcal{B}_{i}} \left\{ || \theta_{i} – \rho_{ij} ||^{2} + || \rho_{ij} – \theta_{j} ||^{2} \right\} \end{align*}

where $\lambda_{ij1}, \lambda_{ij2}, i,j \in V$ are the Lagrange multipliers, $\eta$ is some positive scalar parameter and $||\cdot||$ is induced norm. The last term (modulated by $\eta$ ) is not strictly necessary for consensus but introduces additional regularization.

3. Distributed Probabilistic Principal Component Analysis (D-PPCA)

Distributed versions of PPCA and missing-data PPCA can be derived straightforwardly based on the model above. Detailed information, including derivation of iterative formula for distributed EM [5] can be found in Yoon and Pavlovic (2012).

We tested D-PPCA using synthetic Gaussian data including the case when some of the values are missing. As one can see in Figure 2, D-PPCA, regardless of existence of missing data (either missing-at-random (MAR) or missing-not-at-random (MNAR)), showed competence against the centralized counterpart. We also report empirical convergence analysis in the supplementary material.

We also applied D-PPCA to the problem of distributed affine SfM. We conducted experiments on both synthetic (multiple cameras observing a rotating cube) and real settings. For real settings, we used videos obtained from Caltech [7] and Johns Hopkins [8]. We simulated multiple camera setting by sequentially dividing frames by the number of cameras, in our case 5, i.e. frame no. 1~6 are assigned to camera 1, 7~12 are assigned to camera 2, etc. assuming we have 5 cameras and 30 frames in total. We compared D-PPCA reconstructed structure with centralized, SVD-based reconstructed structure by using subspace angle. Table 1 below shows the result we obtained from Caltech turntable dataset. It clearly shows that D-PPCA rivals that of SVD-based methods even when some values in observation are missing. For detailed and additional results and explanation, please refer the attached manuscript and supplementary materials.

4. Distributed Bayesian Principal Component Analysis (D-BPCA)

We can also apply the similar framework to obtain distributed extension of the mean field variational inference formulation (MFVI). It is easy to show that the distributed counterpart is equivalent to the centralized mean field variational inference optimization problem:

\begin{align*}\label{ooo} [\hat{\lambda}_Z, \hat{\lambda}_W] = &\underset{\lambda_{Z_i}, \lambda_{W_i}: i \in V}{\arg\min}\; – \mathbb{E}_Q\big[ \log P(X,Z,W|\Omega_z,\Omega_w) \big] + \mathbb{E}_Q[\log Q] \nonumber \\ &s.t. \;\; \lambda_{W_i} = \rho_{ij},\;\;\rho_{ij} = \lambda_{W_j},\;\; i\in V, j\in \mathcal{B}_i \end{align*}

where $Z=\{z_i \in \mathbb{R}^{M}\}_{i=1}^{N}$ denote a set of local latent variables,  $W$ denotes a global latent variable and $\Omega=[\Omega_z, \Omega_w]$ denote a set of fixed parameters, and the form of $Q(z_n;\lambda_{z_n})$ and $Q(W;\lambda_{W})$ are set to be in the same exponential family as the conditional distributions $P(W|X,Z,\Omega_w)$, Using conjugate exponential family for prior and likelihood distributions, each coordinate descent update in MFVI can be done in closed form. However, the penalty terms would be quadratic in the norm difference of $(\lambda_{W_i} – \rho_{ij})$, that may result in the non-analytic updates for $\{\lambda_{W_i}\}_{i=1}^{|V|}$

To solve the above problem efficiently, we propose to use Bregman ADMM (B-ADMM) [9] which generalizes the ADMM by replacing the quadratic penalty term by different Bregman divergences in order to exploit the structure of problems. We propose to use the log partition function of the global parameter as the bregman function. Based on the proposed Bregman function, we can obtain the analytical update formula for BADMM, which have closed form solutions. Figure below shows the performance comparison of the distributed affine structure from motion experiment explained above using the centralized versions of SVD, PPCA and BPCA (using variational inference) and the proposed distributed versions of PPCA and BPCA with varying noise levels.

Related Publications

• S. Yoon and V. Pavlovic, “Decentralized Probabilistic Learning For Sensor Networks,” in IEEE Global Conference on Signal and Information Processing, 2016.
[BibTeX]
@InProceedings{yoon16gsip,
author = {Sejong Yoon and Vladimir Pavlovic},
title = {Decentralized Probabilistic Learning For Sensor Networks},
booktitle = {IEEE Global Conference on Signal and Information Processing},
year = {2016},
month = dec,
note = {50\% contribution.},
date-modified = {2016-09-11 21:35:57 +0000},
}

• C. Song, S. Yoon, and V. Pavlovic, “Fast ADMM Algorithm for Distributed Optimization with Adaptive Penalty,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, {USA.}, 2016, p. 753–759.
@InProceedings{yoon16aaai,
author = {Changkyu Song and Sejong Yoon and Vladimir Pavlovic},
booktitle = {Proceedings of the Thirtieth {AAAI} Conference on Artificial Intelligence},
year = {2016},
pages = {753--759},
month = feb,
note = {33\% contribution},
bdsk-url-1 = {http://arxiv.org/abs/1506.08928},
date-modified = {2016-09-11 21:16:22 +0000},
url = {http://arxiv.org/abs/1506.08928},
}

• B. Babagholami, S. Yoon, and V. Pavlovic, “D-MFVI: Distributed Mean Field Variational Inference using Bregman ADMM,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, {USA.}, 2016, p. 1582–158.
@InProceedings{behnam16aaai,
author = {Behnam Babagholami and Sejong Yoon and Vladimir Pavlovic},
title = {{D-MFVI}: Distributed Mean Field Variational Inference using Bregman {ADMM}},
booktitle = {Proceedings of the Thirtieth {AAAI} Conference on Artificial Intelligence},
year = {2016},
pages = {1582--158},
month = feb,
note = {33\% contribution},
bdsk-url-1 = {http://arxiv.org/abs/1507.00824},
date-modified = {2016-09-11 21:16:12 +0000},
url = {http://arxiv.org/abs/1507.00824},
}

• S. Yoon and V. Pavlovic, “Distributed Probabilistic Learning for Camera Networks with Missing Data,” in Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., 2012, p. 2933–2941.
@InProceedings{yoon12nips,
author = {Sejong Yoon and Vladimir Pavlovic},
title = {Distributed Probabilistic Learning for Camera Networks with Missing Data},
booktitle = {Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States.},
year = {2012},
pages = {2933--2941},
note = {50\% contribution},
bdsk-url-1 = {http://papers.nips.cc/paper/4629-distributed-probabilistic-learning-for-camera-networks-with-missing-data},
url = {http://papers.nips.cc/paper/4629-distributed-probabilistic-learning-for-camera-networks-with-missing-data},
}

References

1. R. J. Radke. “A Survey of Distributed Computer Vision Algorithms”. Nakashima, Hideyuki, Aghajan, Hamid, Augusto and J. Carlos eds. Springer Science+Business Media, LLC. 2010.
2. R. Tron and R. Vidal. “Distributed Computer Vision Algorithms”, IEEE Signal Processing Magazine, Vol. 28. 2011, pp. 32-45.
3. M. E. Tipping and C. M. Bishop. “Probabilistic Principal Component Analysis”, Journal of the Royal Statistical Society, Vol. Series B. 1999, pp. 611-622.
4. A. Ilin and T. Raiko. “Practical Approaches to Principal Component Analysis in the Presence of Missing Values”, Journal of Machine Learning Research, Vol. 11. 2010, pp. 1957-2000.
5. P. A. Forero, A. Cano and G. B. Giannakis. “Distributed clustering using wireless sensor networks”, IEEE Journal of Selected Topics in Signal Processing, Vol. 5, August, 2011, pp. 707-724.
6. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, vol. 3, Now Publishers, 2011.
7. P. Moreels and P. Perona. “Evaluation of Features Detectors and Descriptors based on 3D Objects”, International Journal of Computer Vision, Vol. 73. 2007, pp. 263-284.
8. R. Tron and R. Vidal. “A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms“, IEEE International Conference on Computer Vision and Pattern Recognition. 2007.
9. H. Wang and A. Banerjee, “Bregman Alternating Direction Method of Multipliers”, Advances in Neural Information Processing Systems 27, 2014.