Depth Recovery With Face Priors

*Chongyu Chen was with Nanyang Technological University*

1. Abstract.

Existing depth recovery methods for commodity RGB-D sensors primarily rely on low-level information for repairing the measured depth estimates. However, as the distance of the scene from the camera increases, the recovered depth estimates become increasingly unreliable. The human face is often a primary subject in the captured RGB-D data in applications such as the video conference. In this work we propose to incorporate face priors extracted from a general sparse 3D face model into the depth recovery process. In particular, we propose a joint optimization framework that consists of two main steps: deforming the face model for better alignment and applying face priors for improved depth recovery. The two main steps are iteratively and alternatively operated so as to help each other. Evaluations on benchmark datasets demonstrate that the proposed method with face priors significantly outperforms the baseline method that does not use face priors, with up to 15.1% improvement in depth recovery quality and up to 22.3% in registration accuracy.

2. The proposed method.

Given a color image I and its corresponding (aligned) noisy depth map Z as input, our goal is to obtain a good depth map of the face region using the face priors derived from the general 3D deformable model. The pipeline of the proposed method is shown below.

The first two components are pre-processing steps to roughly clean up the depth data and roughly align the general face model to the input point cloud. The last two components are the core of our proposed framework. For component of the guided depth recovery, we fix the face prior and use it to update the depth, while for the last component, we fix the depth and update the face prior. The last two components alternatively and iteratively operate until convergence.

Based on [1], the energy function is formulated as
\[\min\limits_{U, u} E_r(U) + \lambda_d E_d(U) + \lambda_f E_f(U, u)\]

The first two terms are similar to [1]. The last term is the new face prior term, which is defined as following:
\[E_f(U, u) = \sum_{i \in \Omega_f} \eta_i \left( U(i) – T_f( P(u), i) \right)^2\]

U represents the depth map to be recovered, while u represents the parameters of the 3D facial deformable model and Tf is the facial shape transformation function according to u. For more details on the deformation of the face model, please refer to our previous project [2].

Considering that the guidance from the sparse vertices of the Candide model may be too weak to serve as the prior for the full (dense) depth map U, we need to generate a dense synthetic depth map Y from the aligned face prior P(u) using an interpolation process. It is possible to define different interpolation functions according to desired dense surface properties. In computer graphics, such models may use non-uniform rational basis spline (NURBS) to guarantee surface smoothness. Here, for the purpose of a shape prior we choose a simple piece-wise linear interpolation. This process is denoted as
\[Y = \text{lerp}( P(u) )\]
, which is demonstrated in the figure

To mitigate the effects of the piece-wise flat dense patches due to the linear interpolation, we introduce a weighting scheme defined through weights ηi. In particular, for each pixel Y(i), we use a normalized weight that is adaptive to the pixel’s distances from the neighboring vertices of the sparse shape P. Let (ai,bi,ci) be the barycentric coordinates of pixel i inside a triangle defined by its three neighboring vertices of P. Then, its weight is computed as
\[\eta_i = \sqrt{a_i^2+b_i^2+c_i^2}, \ i \in \Omega_f.\]
This suggests that the pixels corresponding to model vertices have the highest weight of $1$ while the weights decline towards the center of each triangular patch. An illustration of the weights is given in the figure below, where bright pixels represent large weights.

3. Energy optimization.

From the definition of the energy function, it can be seen that the overall optimization of U remains a convex task, for a given fixed prior P. However, the optimization of the face model parameter set u might not be convex since it involves rigid and non-rigid deformation. Therefore, to tackle the global optimization task which includes both the depth U and the deformation u recovery, we resort to a standard recursive alternate optimization process. In other words, we will first optimize u while keeping U fixed, and then optimize U for the fixed deformation u. Specifically, we divide our problem into three well studied subproblems: depth recovery, rigid registration, and non-rigid deformation. The algorithm is detailed below:

4. Experiments.
a. Synthetic data

We first use the BU4D Facial Expression Database [3] for quantitative evaluation. Considering that Kinect is the most popular commodity RGB-D sensor, we add some Kinect-like artifacts to the depth maps generated from the BU4D database.

We rendered the BU4DFE data to different distances: 1.2m, 1.5m, 1.75m and 2m. By using synthetic data,
we are able to obtain the ground truth for quantitative evaluation
. Specifically, we measure the depth recovery
performance in terms of average pixel-wise Mean Absolute Error (MAE) in mm. The following plot compare the
results of our method to the baseline [1].

Besides the recovery error, we also evaluate the registration accuracy. To get the reference registration and shapes, we fit the 3D face model to noise-free data. The face model is also fitted to the depth maps obtained by different methods. We then compare the fitting result with the reference registration. Table 1 shows that the proposed method produces a more accurate face registration compared to the baseline method, especially in the eyes’ region and around the face boundary.

Samples at 1.75m

Samples at 2m

Blendshape-based 3D Face Tracking

A result sample of our 3D face tracker, where (a) shows the 3D landmarks projected onto image plane, (b,c) show the 3D blendshape model and the input point cloud, and (d) shows the skinned 3D shape.

1. Abstract.

We introduce a novel robust hybrid 3D face tracking framework from RGBD video streams, which is capable of tracking head pose and facial actions without pre-calibration or intervention from a user. In particular, we emphasize on improving the tracking performance in instances where the tracked subject is at a large distance from the cameras, and the quality of point cloud deteriorates severely. This is accomplished by the combination of a flexible 3D shape regressor and the joint 2D+3D optimization on shape parameters. Our approach fits facial blendshapes to the point cloud of the human head, while being driven by an efficient and rapid 3D shape regressor trained on generic RGB datasets. As an on-line tracking system, the identity of the unknown user is adapted on-the-fly resulting in improved 3D model reconstruction and consequently better tracking performance. The result is a robust RGBD face tracker, capable of handling a wide range of target scene depths, beyond those that can be afforded by traditional depth or RGB face trackers. Lastly, since the blendshape is not able to accurately recover the real facial shape, we use the tracked 3D face model as a prior in a novel filtering process to further refine the depth map for use in other tasks, such as 3D reconstruction.

2. The tracking framework.

In this work, we use the blendshape model from FaceWarehouse database.

The figure shows the pipeline of the proposed face tracking framework, which follows a coarse-to-fine multi-stage optimization design. In particular, our framework consists of two major stages: shape regression and shape refinement. The shape regressor performs the first optimization stage, which is learned from training data, to quickly estimate shape parameters from the RGB frame. Then, in the second stage, a carefully designed optimization is performed on both the 2D image and the available 3D point cloud data to refine the shape parameters, and finally the identity parameters are updated to improve shape fitting to the input RGBD data.

The 3D shape regressor is the key component to achieve our goal of 3D tracking at large distance, where quality of the depth map is often poor. Unlike the existing RGBD-based face tracking works, which either heavily rely on the accurate input point cloud (at close distances) to model shape transformation by ICP or use off-the-shelf 2D face tracker to guide the shape transformation, we predict the 3D shape parameters directly from the RGB frame by the developed 3D regressor. This is motivated by the success of the 3D shape regression from RGB images. The approach is especially meaningful for our considered large distance scenarios, where the depth quality is poor. Thus, we do not make use of the depth information in the 3D shape regression to avoid profusion of inaccuracies from the depth map.

Initially, a color frame I is passed through the regressor to recover the shape parameters θ. The projection of the Nl landmarks vertices of the 3D shape to image plane typically does not accurately match the 2D landmarks annotated in the training data. We therefore include 2D displacements D into the parameter set and define a new global shape parameter set P = ({θ},D) = (R,T,e,D). The advantages of including D in P are two-fold. First, it helps train the regressor to reproduce the landmarks in the test image similar to those in the training set. Second, it prepares the regressor to work with unseen identity which does not appear in the training set. In such case the displacement error D may be large to compensate for the difference in identities. The regression process can be expressed as \[P^{out} = {f_{r}}(I,P^{in})\], where fr is the regression function, I is the current frame, Pin and Pout are the input (from the shape regression for the previous frame) and output shape parameter sets, respectively. The coarse estimates Pout are refined further in the next stage, using more precise energy optimization added with depth information. Specifically, \[ \theta = (R,T,e)\] are optimized w.r.t both the 2D prior constraints provided by the estimated 2D landmarks by the shape regressor and the 3D point cloud. Lastly, the identity vector wid is re-estimated given the current transformation. (For more details, please refer to our manuscript on arXiv).

The effect of using depth data for regularization: (a,b) without depth data; (c,d) with depth data

Identity adaptation:

3. Tracking results.

– On BU4DFE dataset

– On real RGBD sequences

with occlusion:

4. Depth recovery using dense shape priors.

Based on our previous work, we replace the sparse Candide face model with blendshape and develop the depth recovery process as a filter on depth map.

A result sample on real data at 2m: (a) the prior, (b) the raw depth data (c) filtered without prior (d) filtered with prior


  • H. X. Pham, C. Chen, L. N. Dao, V. Pavlovic, J. Cai and T.-J. Cham. “Robust Performance-driven 3D Face Tracking in Long Range Depth Scenes”. 2015

Multi-Cue Structure Preserving MRF for Unconstrained Video Segmentation


Video segmentation is a stepping stone to understanding video context. Video segmentation enables one to represent a video by decomposing it into coherent regions which comprise whole or parts of objects. However, the challenge originates from the fact that most of the video segmentation algorithms are based on unsupervised learning due to expensive cost of pixelwise video annotation and intra-class variability within similar unconstrained video classes. We propose a Markov Random Field model for unconstrained video segmentation that relies on tight integration of multiple cues: vertices are defined from contour based superpixels, unary potentials from temporal smooth label likelihood and pairwise potentials from global structure of a video. Multi-cue structure is a breakthrough to extracting coherent object regions for unconstrained videos in absence of supervision. Our experiments on VSB100 dataset show that the proposed model significantly outperforms competing state-of-the-art algorithms. Qualitative analysis illustrates that video segmentation result of the proposed model is consistent with human perception of objects.

1  Overview

Figure 1: Overview of the framework. (a) Node potential depends on histogram of temporal smooth pixelwise labels of the corresponding frame. Spatial edge potentials: (b) Gray intensity represents contour strength. (c) RGB color is displayed for better visualization. (d) Color represents motion direction. (e) Color represents visual word identity of each dense SIFT feature. Temporal edge potential depends on correspondence ratio on long trajectory and color affinity. (f) Superpixels for corresponding vertices in the frame f are illustrated by object contours. For visualization purpose, it shows coarse grained superpixels. Best viewed in color.


2  Contributions

In this paper, we propose a novel hierarchical video segmentation model which integrates temporal smooth labels with global structure consistency with preserving object boundaries. Our contributions are as follows:

•     We propose a video segmentation model that preserves multi-cue structures of object boundary and temporal smooth label with global spatio-temporal consistency.

•     We propose an effective pairwise potential to represent spatio-temporal structure evaluated on object boundary, color, optical flow, texture and long trajectory correspondence.

•     Video hierarchy is inferred through the process of graph edge consistency, which generalizes traditional hierarchy induction approaches.

•     The proposed method infers precise coarse grained segmentation, where a segment may represent one whole object.


3  Proposed Model

3.1  Multi-Cue Structure Preserving MRF Model

An overview of our framework for video segmentation is depicted in Figure 1. A video is represented as a graph G=(V,E), where a vertex set is defined on contour based superpixels from all frames f∈{1,⋯,F} in the video. For each frame, an object contour map is obtained from contour detector [1]. A region enclosed by a contour forms a superpixel. An edge set describes relationship for each pair of vertices. The edge set consists of spatial edges where and temporal edges where .

Video segmentation is obtained by MAP inference on a Markov Random Field on this graph G, where and Z is the partition function. Vertex i is labeled as from the label set L of size L. MAP inference is equivalent to the following energy minimization problem.

In (1), represents node potentials for a vertex iV and is edge potentials for an edge . As with the edge set E, edge potentials are decomposed into spatial and temporal edge potentials, . The vector indicates label and is the label pair indicator matrix for and . Operators ⋅ and : represent inner product and Frobenius product, respectively. Spatial edge potentials are defined for each edge which connects the vertices in the same frame . In contrast, temporal edge potentials are defined for each pair of vertices in the different frames . It is worth noting that the proposed model includes spatial edges between two vertices that are not spatially adjacent and, similarly, temporal edges are not limited to consecutive frames.

A set of vertices of the graph is defined from contour based superpixels such that the inferred region labels will preserve accurate object boundaries. Node potential parameters are obtained from temporally smooth label likelihood. Edge potential parameters aggregate appearance and motion features to represent global spatio-temporal structure of the video. MAP inference of the proposed Markov Random Field(MRF) model will infer the region labels which preserve object boundary, attain temporal smoothness and are consistent to global structure. Details are described in the following sections.

3.2  Node Potentials

Unary potential parameters represent a cost of labeling vertex iV from a label set L. While edge potentials represent global spatio-temporal structure in a video, node potentials in the proposed model strengthen temporal smoothness for label inference. Temporal smooth label set L is obtained from a greedy agglomerative clustering [10]. The clustering algorithm merges two adjacent blobs in a video when color difference is smaller than the variance of each blob. Node potential parameters represent labeling cost of vertex i from negative label likelihood .

Each superpixel is evaluated by pixelwise cluster labels from L and the label histogram represents label likelihood for the vertex i. As illustrated in Figure 1 (a), a superpixel has a mixture of pixelwise temporal smooth labels because the agglomerative clustering [10] merges unstructured blobs. Let be the number of pixelwise temporal smooth labelb in the corresponding superpixel of vertex i. As described in 3.1, a vertex is defined on a superpixel which is enclosed by an object contour. Arbelaez et al. [1] extract object contours so that taking different threshold values on the contours will produce different granularity levels of enclosed regions. In our proposed model, we take a set of vertices from a video frame f by a single threshold on contours which results in fine-grained superpixels.

3.3  Spatial Edge Potentials

Binary edge potential parameters ψ consist of two different types; spatial and temporal edge potentials, and , respectively . Spatial edge potentials model pairwise relationship of two vertices i and j within a single video frame f. We define these pairwise potentials as follows:


A spatial edge potential parameter is the (l,l‘) element of matrix which represents the cost of labeling a pair of vertices i and j as l and l‘, respectively. It takes Potts energy where all different pairs of label take homogeneous cost . Spatial edge potentials are decomposed into , which represent pairwise potentials in the channel of object boundary, color, optical flow direction and texture. Pairwise cost of having different labels is high if the two vertices i and j have high affinity in the corresponding channel. As a result, edge potentials increase the likelihood of assigning the same label to vertices i and j during energy minimization.

The edge potentials take equal weights on all channels. Importance of each channel may depend on video context and different videos have dissimilar contexts. Learning weights of each channel is challenging and it is prone to overfitting due to high variability of video context and limited number of labeled video samples in the dataset. Hence, the propose model equally weights all channels.

The model controls the granularity of segmentation by a threshold τ. In (9), the pairwise potential is thresholded by τ. If τ is set to a high value, only edges with higher affinity will be included in the graph. On the other hand, if we set a low value to τ, the number of edges increases and more vertices will be assigned to the same label because they are densely connected by the edge set. We next discuss each individual potential type in the context of our video segmentation model.

Object Boundary Potentials . Object boundary potentials evaluate cost of two vertices i and j in the same frame assigned to different labels in terms of object boundary information. The potential parameters are defined as follows:

where represents the minimum boundary path weight among all possible paths from a vertex i to j. The potentials are obtained from Gaussian Radial Basis Function(RBF) of with which is the mean of as a normalization term.

If the two superpixels i and j are adjacent, their object boundary potentials are decided by the shared object contour strength , where is the edge connects vertices i and j and the boundary strength is estimated from contour detector [1] . The boundary potentials can be extended to non-adjacent vertices i and j by evaluating a path weight from vertex i to j. For each path p from a vertex i to j, boundary potential of path p is evaluated by taking the maximum edge weights where is an edge along the path p. The algorithm to calculate is described in Algorithm 1, which modifies Floyd-Warshall shortest path algorithm.

Typically, a path in a graph is evaluated by sum of edge weights along the path. However, in case of boundary strength between the two non-adjacent vertices in the graph, total sum of the edge weights along the path is not an effective measurement because the sum of weights is biased toward the number of edges in the path. For example, a path consists edges of weak contour strength may have the higher path weight than another path which consists of smaller number of edges with strong contour. Therefore, we evaluate a path by the maximum edge weight along the path and the path weight is govern by an edge of the strongest contour strength.

Figure 2 illustrates two different path weight models of the max edge weight and the sum edge weight. Figure 2 (a) illustrates contour strength where red color represents high strength. Two vertices indicated by white arrows are selected in an airplane. In Figure 2 (b), two paths are displayed. Path 2 consists of less number of edges but it intersects with a strong contour that represents boundary of the airplane. If we evaluate object boundary score between the two vertices, Path 1 should be considered since it connects vertices within the airplane. Figure 2 (c) shows edge sum path weight from a vertex at tail to all the other vertices. It displays that the minimum path weight between the two vertices are evaluated on Path 2. On the other hand, Figure 2 (d) illustrates that max edge path weight takes Path 1 as minimum path weight which conveys human perception of object hierarchy.


Figure 2: Comparison of two types of path weight models.

Color Potentials . Color feature for each vertex is represented by a histogram of CIELab color space in the corresponding superpixel. Color potential between the vertex i and j is evaluated on two color histograms and :

where is Earth Mover’s Distance(EMD) between and of vertices i and j and is the normalization parameter.

Earth Mover’s Distance [16] is a distance measurement between two probability distributions. EMD is typically more accurate over distance in color space of superpixels. An issue with distance is that if the two histograms on simplex do not share non-zero color bins, the two histogram are evaluated with the maximum distance of 1. Therefore, distance of vertices i and j is the same as the distance between i and k, if i,j,k do not share any color bins. This occurs often when we compare color feature of superpixels because superpixel is intended to exhibit coherent color especially in the fine grained level. Superpixels on different objects or different parts of an object may have different colors. For example, if we use distance to measure color difference of superpixels, distance between superpixels of red and orange will have the same distance of red and blue because they do not share color bins. However, this is not intuitive to human perception. In contrast, EMD considers distance among each color bin, hence it is able to distinguish non overlapping color histograms.

Optical Flow Direction Potentials . In each video frame, motion direction feature of ith vertex can be obtained from a histogram of optical flow direction . As with the case of color potentials, we use EMD between the two histograms and to accurately estimate difference direction in motion:

where is the mean EMD distance on optical flow histogram.

Texture Potentials . Dense SIFT features are extracted for each superpixel and Bag-of-Words(BoW) model is obtained from K-means clustering on D-SIFT features. We evaluate SIFT feature on multiple dictionaries of different K. Texture potentials are calculated from RBF on distance of two BoW histograms and , which is a typical choice of distance measurement for BoW model:

where parameter is the mean distance on D-SIFT word histogram.

3.4  Temporal Edge Potentials

Temporal edge potentials define correspondence of vertices at different frames. It relies on long trajectories which convey long range temporal dependencies and more robust than optical flow.

where is a set of long trajectories which pass through vertex i. Pairwise potential represents temporal correspondence of two vertices from overlapping ratio of long trajectories that vertices i and j shares, where and ff‘. In order to distinguish two different objects of the same motion, we integrate color potentials between two vertices. Long trajectories are extracted from  [18].

3.5  Hierarchical Inference on Segmentation Labels

The proposed model attains hierarchical inference of segmentation labels by controlling the number of edges with a fixed set of vertices defined at a finest level of superpixels. As the edge set becomes dense in the graph, the energy function in (1) takes higher penalties from the pairwise potentials. As a consequence, vertices connected by dense edges will be assigned to the same label and it leads to coarse-grained segmentation.

In contrast, another approach that enables hierarchical segmentation is to define a hierarchical vertex set in a graph. A set of vertices in the finer level will be connected to a vertex in coarser level. It introduces another set of edges which connect vertices at different levels of hierarchy.

Our proposed approach on hierarchical inference takes computational advantages over graph representation with a hierarchical vertex set. Our proposed graph representation has less the number of vertices and edges because we have a single finest level of hierarchy without additional vertices for coarser levels. This advantage not only enables an efficient graph inference, but also take less computation time to calculate node and edge potentials for additional vertex and edge sets.

4  Experimental Evaluation

4.1  Dataset

We evaluate the proposed model on VSB100 video segmentation benchmark data provided by Galasso et al. [9]. There are a few additional video datasets which have pixelwise annotation. FBMS-59 dataset [15] consists of 59 video sequences and SegTrack v2 dataset [13] consists of 14 sequences. However, the both datasets annotate on a few major objects leaving whole background area as one label. It is more appropriate for object tracking or background subtraction task. On the other hand, VSB100 consists of 60 test video sequences of maximum 121 frames. For each video, every 20 frame is annotated with pixelwise segmentation labels by four annotators. The dataset contains the largest number of video sequences annotated with pixelwise label, which allows quantitative analysis. The dataset provides a set of evaluation measurements.

Volume Precision-Recall. VPR score measures overlap of the volume between the segmentation result of the proposed algorithm S and ground truths annotated by M annotators. Over-segmentation will have high precision with low recall score.

Boundary Precision-Recall. BPR score measures overlap between object boundaries of the segmentation result S and ground truths boundaries . Conversely to VPR, over-segmentation will have low precision with high recall scores.

4.2  MSP-MRF Setup

In this section, we present the detailed setup of our Multi-Cue Structure Preserving Markov Random Field (MSP-MRF) model for unconstrained video segmentation problem. As described in Section 3.2, we take a single threshold on image contour, so that each frame contains approximately 100 superpixels. We assume that this granularity level is fine enough such that no superpixel at this level will overlay on multiple ground truth regions. Node potential (6) is evaluated for each superpixel with temporal smooth label obtained with agglomerative clustering [10]. Although we chose the 11th fine grained level of hierarchy, Section 4.4 illustrates that the proposed method shows stable performance over different label set size |L| for node potential. Finally, edge potential is estimated as in (9), (14). For color histograms, we used 50 bins for each CIELab color channel. In addition, 50 bins were set for horizontal and vertical motion of optical flow. For D-SIFT Bag-of-Words model, we used 5 dictionaries of K=100,200,400,800,1000 words. Energy minimization problem in (1) for MRF inference is optimized using FastPD algorithm [12].



Figure 3: Temporal consistency recovered by MSP-MRF.


Figure 4: Comparison of segmentation boundary on the same granularity levels on two videos.

4.3  Qualitative Analysis

Figure 3 illustrates a segmentation result on an airplane video sequence. MSP-MRF rectifies temporally inconsistent segmentation result of [10]. For example, in the fourth column of Figure 3, the red bounding boxes show MSP-MRF rectified label from Grundmann’s result such that labels across frames become spatio-temporally consistent.

In addition, control parameter τ successfully obtains different granularity level of segmentation. For MSP-MRF, the number of region labels is decreased as τ decreases. Figure 4 compares video segmentation results of MSP-MRF with Grundmann’s by displaying segmentation boundary on the same granularity levels, where the two methods have the same number of segments in the video. MSP-MRF infers spatial smooth object regions, which illustrates the fact that the proposed model successfully captures spatial structure of objects.


Figure 5: PR curve comparison to other models.


Figure 6: PR curve on different size of label set L.

Table 1: Performance of MSP-MRF model compared with state-of-the-art video segmentation algorithms on VSB100.

4.4  PR Curve on High recall regions

We specifically consider high recall regions of segmentation since we are typically interested in videos with relatively few objects. Our proposed method improves and rectifies state-of-the-art video segmentation of greedy agglomerative clustering [10], because we make use of structural information of object boundary, color, optical flow, texture and temporal correspondence from long trajectories. Figure 5 shows that the proposed method achieves significant improvement over state-of-the-art algorithms. MSP-MRF improves in both BPR and VPR scores such that it is close to Oracle which evaluates contour based superpixels on ground truth. Hence, it is worth noting that oracle is the best accuracy that MSP-MRF could possibly achieve because MSP-MRF takes contour based superpixels from [1] as well.

The proposed MSP-MRF model rectifies agglomerative clustering by merging two different labels of vertices if it reduces overall cost defined in (1). By increasing the number of edges in the graph by lowering threshold value, the model leads to coarser grained segmentation. As a result, MSP-MRF only covers higher recall regions from precision-recall scores of the selected label set size |L| from [10]. A hybrid model that covers high precision regions is described in Section 4.5.

Figure 6 illustrates the PR curve of MSP-MRF on different granularity levels of label set |L| in node potential (6). Dashed-green line is the result of greedy agglomerative clustering [10]. Solid-green line is the result of MSP-MRF with edge threshold τ set to 1, which leaves no edge in the graph. The figure shows that results of MSP-MRF are stable over different size of |L|, particularly in the high recall regions.

4.5  Hybrid Model for Over Segmentation

The proposed model effectively merges labels of each pair of nodes according to edge set E. As the number of edges increases, the size of the inferred label set will decrease from |L|, which will cover higher recall regions. Although we are interested in high recall regions, the model needs to be evaluated on high precision regions of PR curve. For this purpose, we take a hybrid model that obtains rectified segmentation results from MSP-MRF on the high recall regions but retains segmentation result of [10] on high precision regions as an unrectified baseline.

Table 1 shows performance comparison to state-of-the-art video segmentation algorithms. The proposed MSP-MRF model outperforms state-of-the-art algorithms on most of the evaluation metrics. BPR and VPR is described in Section 4.1. Optimal dataset scale(ODS) aggregates F-scores on a single fixed scale of PR curve across all video sequences, while optimal segmentation scale(OSS) selects the best F-score with different scale for each video sequence. All the evaluation metrics are followed from dataset [9]. It is worth noting that our MSP-MRF model achieves best ODS and OSS results for both BPR and VPR evaluation measurements, which are equivalent to results of Oracle. As described in Section 4.4, Oracle is a model that evaluates contour based superpixels on ground truth.

MSP-MRF infers segmentation label by integrating object boundary, global structure and temporal smoothness based on  [10]. The result shows that incorporating boundary and global structure rectifies  [10] by significant margin. It should be noted that result of  [10] is higher than previously reported in  [9]. We assume this is due to implementation updates on [10] over recent years. Qualitatively, we observe that recent implementation of [10] detects objects whose appearance is less distinctive from background, where the previous implementation could not elucidate objects under those circumstances.

5  Conclusion

In this paper, we have presented a novel video segmentation model that considers three important aspects of video segmentation. The model preserves object boundary by defining vertex set from contour based superpixels. In addition, temporal smooth label is inferred by providing unary node potential from agglomerative clustering label likelihood. Finally, global structure is enforced from pairwise edge potential on object boundary, color, optical flow motion, texture and long trajectory affinities. Experimental evaluation shows that the proposed model outperforms state-of-the-art video segmentation algorithm on most of the metrics.



[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(5):898–916, May 2011.

[2] V. Badrinarayanan, F. Galasso, and R. Cipolla. Label propagation in video sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[3] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 2008.

[4] T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(3):500–513, 2011.

[5] A. Elqursh and A. M. Elgammal. Online motion segmentation using dynamic label propagation. In IEEE International Conference on Computer Vision (ICCV), pages 2008–2015, 2013.

[6] B. Fröhlich, E. Rodner, M. Kemmler, and J. Denzler. Large-scale gaussian process multi-class classification for semantic segmentation and facade recognition. Machine Vision and Applications, 24(5):1043–1053, 2013.

[7] F. Galasso, R. Cipolla, and B. Schiele. Video segmentation with superpixels. In Asian Conference on Computer Vision (ACCV), 2012.

[8] F. Galasso, M. Keuper, T. Brox, and B. Schiele. Spectral graph reduction for efficient image and streaming video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[9] F. Galasso, N. S. Nagaraja, T. J. Cardenas, T. Brox, and B. Schiele. A unified video segmentation benchmark: Annotation, metrics and analysis. In IEEE International Conference on Computer Vision (ICCV), December 2013.

[10] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph based video segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[11] A. Khoreva, F. Galasso, M. Hein, and B. Schiele. Learning must-link constraints for video segmentation based on spectral clustering. In German Conference on Pattern Recognition (GCPR), 2014.

[12] N. Komodakis and G. Tziritas. Approximate labeling via graph cuts based on linear programming. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29(8):1436–1453, Aug. 2007.

[13] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In IEEE International Conference on Computer Vision (ICCV), 2013.

[14] B. Nadler and M. Galun. Fundamental limitations of spectral clustering methods. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2007. MIT Press.

[15] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(6):1187 – 1200, Jun 2014.

[16] O. Pele and M. Werman. Fast and robust earth mover’s distances. In IEEE International Conference on Computer Vision (ICCV), 2009.

[17] P.Ochs and T.Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In IEEE International Conference on Computer Vision (ICCV), 2011.

[18] T.Brox and J.Malik. Object segmentation by long term analysis of point trajectories. In European Conference on Computer Vision (ECCV), Lecture Notes in Computer Science. Springer, Sept. 2010.

[19] C. Zhang, L. Wang, and R. Yang. Semantic segmentation of urban scenes using dense depth maps. In European Conference on Computer Vision (ECCV), pages 708–721, Berlin, Heidelberg, 2010. Springer-Verlag.

Sea level prediction using Gaussian Process Models

Sea level prediction is a complicated spatial temporal regression problem that draw a lot of attention these days. Understanding sea level behavior can help us know more about climate change and consequent effects.  However, predicting sea level is not really easy, when we have to deal with many problems like noisy obeservations, censored data, and so on. In this project, we focus on Gaussian process (GPs)  for modelling sea level because of its flexibility and effectiveness.

At the first stage, we are working on how to predict sea level using uncertain inputs with ordering  constraints. One of the inputs to predict the sea level is the information of ages, however due to the limitation of C14 dating technique, we can not obtain true ages of the records, but a noisy version of them. In addition, the true inputs must be in decreasing order. Utilizing this information, we propose a fast and accurate method to estimate the true inputs, and hyper-parameters in GP models.

String Kernel Methods for DNA sequence analysis

The research I have been working on is String Kernel Methods for DNA sequence analysis, under the supervision of Prof. Pavlovic. I focus on the problem of species-level identification based on short DNA fragments known as barcodes. Kernel methods approach classification by mapping original data into a set of points in the feature space that potentially makes it easier to detect complex relationship in the data. Thus, in turn, leads to learning algorithms that can exhibit higher classification accuracy and robustness.

Financial Time Series Analysis

Problem Definition

Financial decisions with respect to investing in industry based indices are often based on heuristics and non-standard methods or purely based on the company specific algorithmic methods. In-depth analysis of the historic stock market behavior and dynamics among different industries are critical for predicting the future trading outcomes. It is also important to identify which companies’ stock prices are leading or lagging and perform in a similar trend to other companies, so that we can identify groups of companies which behave similarly in a certain time period for better investment decision making.

Our Approaches

We are working on methods to enhance the applicability of time series analysis on historic stock market related data to identify specific groupings of companies with similar patterns/behavior and also verify applicability to GICS identification (Global Industry Classification Standard). The GICS sectors are defined as Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Industrials, Information Technology, Materials, Telecommunication Services and Utilities. We have worked on a variety of different machine learning and probabilistic methods as descibed below.
  • Finding similarities between time series sequences using sting kernel matching

The time series sequences of historic stock prices are represented by string sequences (after taking a sliding window based approach) defined from a finite alphabet and then we use the method proposed by Pavel et al [1] to find similarities between different string sequences using mismatch kernels. Here we only use a local mismacth kernel, where we find the simlarity between pair of strings within a specified time lag (for example within 2 weeks) unlike the global mistmatch kernel where the similarity is found between all pairs of possible strings, because in financial domain the impact of one time series to another is short term since the stock market is efficient and the longer term impact would be minimal. Then we do clustering using Affinity propogation algorithm to find similar pattern representing companies/tickers and see how well we are performing in accordance to the GICS classification standard and one set of results we have obtained is shown in the following table.

  • Granger Causality based analysis

Granger causality which is a statistical technique introduced by Nobel prize winner, Clive Granger to find whether a given time series has a causal relationship with another time series. We used this statistical hypothesis to test how different historic time series sequences of one company/ticker is affected by the lagged-time sequences of other companies/tickers. We performed this analysis for companies/tickers within and between different industry sectors (as defined by GICS classification) with different thresholds of statistical significance levels and analyzed how the resulting causal graphs vary over time. This type of analysis led to the idea of looking at time varying graphs[2] which identifies how the causal relationships between the companies/tickers change over time. The following graph shows the number of granger causal links within each industry sector for a specific time period in concern.

  • Sparse Regression based analysis
We have also tried Lasso regression on modelling the linear relationship between a single ticker/company’s time series with respect to other tickers’/companies’ time series. This analysis was important to find a sparse representation of the relationship between tickers within the same sector and between different sectors. The following bar graph shows the within sector links and between sector links distribution after using lasso regression with an appropriate penalty parameter set after validation set of time sequences.

  • Random Graph based analysis
We are also interested in looking at how we can model the relationship between different time series sequences using random graphs. We have tried some experiments using Exponential Random Graph (ERGM) based models and how we can model these financial time series using a set of network parameters such as density, number of mutual edges, Number of triangles, etc which indirectly controls the structure of the graphs and how sparse/dense they are. This type of analysis can also be used to model how the graphs and their parameters their structure changes over time, resulting in a dynamic graph analysis methods which could discover important links between companies and how they change over time.

[1] Kuksa,Huang & Pavlovic, Scalable Algorithms for String Kernels with Inexact Matching, Neural Information Processing Systems 2008 (NIPS 2008) 
[2] Kolar, Ahmed, Xing, Estimating Time Varying Networks, 2010, The Annals of Applied Statistics, Vol. 4, No. 1, 94–123


Conditional Ordinal Random Fields

Conditional Random Fields (CRFs) and Hidden Conditional Random Fields (HCRFs) are a staple of many sequence tagging and classification frameworks. An underlying assumption in those models is that the state sequences (tags), observed or latent, take their values from a set of nominal categories. These nominal categories typically indicate tag classes (e.g., part-of-speech tags) or clusters of similar measurements. However, in some sequence modeling settings it is more reasonable to assume that the tags indicate ordinal categories or ranks. Dynamic envelopes of sequences such as emotions or movements often exhibit intensities growing from neutral, through raising, to peak values.

In this project we develop models and algorithms for sequences of ranks or ordinal categories.  Our first model, CORF (Conditional Ordinal Random Field) [1]  extends is to ordinal latent data what CRF is to nominal data.  HCORF (Hidden Conditional Ordinal Random Field) [2] generalizes this idea to latent settings, where we cannot observe ordinal ranks but still want to model dynamics in this space. 

We have applied these models to analysis of facial emotions and facial emotion intensities, as well as classification of human activities from video sequences. 

Software: code.


  • [1] M. Kim and V. Pavlovic. “Structured output ordinal regression for dynamic facial emotion intensity prediction”. Computer Vision – ECCV 2010. Daniilidis, Kostas, Maragos, Petros, Paragios and Nikos eds. 2010. pp. 649-662.
  • [2] M. Kim and V. Pavlovic. “Hidden Conditional Ordinal Random Fields for Sequence Classification”. ECML/PKDD. 2010. pp. 51-65.

Sparse Granger Causality Graphs for Human Action Classification

1. Overview

Modeling and classification of human actions are important problems that have received significant attention in pattern recognition.  Mocap data is widely available and can serve as a good proxy for assessing action models before they are applied to video data. In this paper, we present a human action classification framework that extends the video analysis using Granger causality graphs to represent densely sampled human actions embodied in mocap data. We accomplish this by defining sparse events detected in movements of human body parts. The events are taken as nodes of a graph and edge weights are calculated from Granger causality between pairs of events. The graph describes human actions in terms of causal relationship among body parts movements.

Fig 1. Framework overview. Walking action is shown on the left column and jumping on the right. Top row depicts example of mocap sequences \(\bf{d_k}\). Two point processes on the events of right leg \(N_{k1}\) and left leg \(N_{k4}\) are shown for each action. Different temporal patterns are observed for different actions. From the point processes, Granger causality graph \(G_k\)  is constructed to represent its motion by causal relations between events. For walking sequence, the event in left leg causes right leg(\(G_{N_{k1}\to N_{k4}}\)). But the same causal relationship is not observed for jumping. Finally, a model that classifies causal graphs is learned for each action class.

2. Prior work

Granger causality is a statistical test to detect a relationship between two time series [1]. In prediction for a time series \(X\), it can be seen that another time series \(Y\) causes \(X\) if adding \(Y\) helps prediction of \(X\). Given two auto regressive (AR) models of \(X\)

\[ X_t=\sum_{j=1}^{\infty}a_{1j}X_{t-j}+\epsilon_{1t},\>\>\epsilon_{1t} \sim \mathcal{N}(0, \Sigma_1), \tag{1} \]

\[ X_t=\sum_{j=1}^{\infty}a_{2j}X_{t-j}+\sum_{j=1}^{\infty}b_{2j}Y_{t-j}+\epsilon_{2t}, \>\>\epsilon_{2t} \sim \mathcal{N}(0, \Sigma_2), \tag{2} \]

the causal power \(G_{Y \to X}\) is high if adding \(Y\) reduces prediction error of \(X\). Thus, Granger causality is defined as: \(G_{Y \to X}=\ln(\Sigma_1/\Sigma_2)\).

Non-parametric pairwise Granger causality is calculated as follows: given two point processes \(n_X\) and \(n_Y\), a power spectral matrix \(S_{XY}\) is defined as the Fourier transform of covariance of two point processes \(n_{X}, n_{Y},\) which is estimated using the multitaper function \(h_k(t_j)\) [2]:

\[ S_{XY}(f)=\frac{1}{2\pi KT}\sum_{k=1}^{K}\widetilde{n_X}(f,k)\widetilde{n_Y}(f,k)^*,\tag{3} \]
\[ \widetilde{n_i}(f,k)=\sum_{j}h_k(t_j)\exp(-i2 \pi f t_j),\tag{4}\]

and \(S_{XY}\) is factorized by Wilson’s algorithm as follows:

\[ S_{XY}(f)=H_{XY}(f) \Sigma_{XY} H^{*}_{XY}(f),\tag{5}\]

where \(H_{XY}\) is the transfer function which corresponds to coefficient of AR model, \(\Sigma\) corresponds to covariance matrix of error term of AR model and \(\ast\) represent conjugate transpose.
Nonparametric pairwise Granger causality of \(G_{n_Y \to n_X}\) for frequency \(f\) is finally calculated as:

\[ G_{n_{Y} \to n_{X}}(f)=\ln\frac{S_{XX}(f)}{S_{XX}(f)-(\Sigma_{YY}-\frac{\Sigma_{XY}^2}{\Sigma_{XX}}) |H_{XY}(f)|^2},\tag{6} \]

We will use this notion of causality in analyzing the mocap data, where many motions exhibit natural (quasi) periodic behavior.

3. Sparse Granger Causality Graph Model


Algorithm 1. Sparse Granger Causality Graph Model

Input: mocap dataset \( D = \{(\mathbf{d}_1, y_1), \ldots, (\mathbf{d}_n, y_n)\}\)

1. Generate a set of point processes \(N_k\) from \(\mathbf{d}_k\)

\({\bf for } \textrm{joint } i\)

\(N_{ki} \leftarrow \{1(t)|d_{ki}^t = \textrm{peak or valley}\}\)

\({\bf end for}\)

2. Estimate Granger causality graph \(G_k\) from \(N_k\)

\({\bf for}\) joint pair \(X,Y\)

Estimate spectrum \(S_{XY}\) from Eq. (3)

Factorize \(S_{XY}\) from Eq. (5)

Estimate Granger causality \(G_{n_Y \to n_X}\) from Eq. (6)

\({\bf end for}\)

3. Learn a sparse model

\({\bf for}\) action class \(c_i\)

Learn L1 regularized log. reg. model from \(\{G_k|y_k = c_i\}\)

\({\bf end for}\)

Our approach to building the sparse causality graph models of human actions is summarized in Algorithm 1.  The approach, denoted by SGCGM, has three major steps:

1. From each mocap sequence, events are detected and point processes are generated on the events. A mocap sequence consists of multiple time series densely recorded for each joint angle. From each dense time series of a joint angle, two different events of peak and valley are extracted through the extreme point detector. As a result, we define \(M\) events over all joints, and \(M\) point processes on events construct a set of point processes \(N_k\) for a mocap sequence \(\bf{d_k}\). We assume that representing joint angle trajectories with extreme points conveys enough information to construct causal structure among joints.

2. From a set of point processes \(N_k\) for a mocap sequence \(\bf{d_k}\), a Granger causality graph \(G_k\) is constructed. For each pair of point processes \(N_k\), a power spectrum \(S\) is estimated by the multitaper method.
Pairwise non-parametric Granger causality is calculated over \(F\) frequencies from the Equation \eqref{eq:G}. As a result, a Granger causality graph is represented in \(F\) adjacency matrices of size \(M\textrm{x}M\), one for each frequency band. Each node represents an event in the point process \(n_X\) and a pairwise causality power \(G_{n_Y \to n_X}\) is reflected as a direct edge weight between node \(X\) and \(Y\).2. From a set of point processes \(N_k\) for a mocap sequence \(\bf{d_k}\), a Granger causality graph \(G_k\) is constructed. For each pair of point processes \(N_k\), a power spectrum \(S\) is estimated by the multitaper method.

3. After each mocap sequence is converted into a causal graph, we learn a model that classifies the causal graph into one of the action classes. In order to capture sparse structure of the graph displayed across samples of each class, we exploit an L1 regularized logistic regression model. To represent the graph, we take Granger causality of all edges and frequencies as a feature.
A sparse regression model will learn regression coefficients between the input features and the class label. The classification model is learned for each action class from the training data and each test data is classified to the action that shows highest confidence level on the logistic function.

4. Experimental Evaluation

We performed experiments on the HDM05 dataset [3]. HDM05 contains mocap data in form of 29 skeletal joints, each of 2-3 rotation angles, resulting in 62 joint angle time series. From each time series, two events of peak and valley are detected. As a result, the number of total point processes M is set to 124. Also, we set the number of frequencies F in a power spectrum to 128. Upon computing the Granger causality graph for all 128 frequencies, we summarize them into 4 bands of high, mid-high, mid-low and low frequency by applying Hanning window in log scale.


SGCGM depends on the model learned for each class, which requires sufficient number of samples. HDM05 dataset is well-suited for our requirement with more than 100 classes and multiple trials performed by 5 subjects. We select 8 action classes that have a sufficient number of samples across subjects. The chosen classes are listed in Table 2.


We perform 5-fold cross validation in two different settings. In the first one the data is randomly samples across subjects so that both test and train data contain samples of motions performed by the same person. In the second setting we split the data so that data from the test subject was not used during training. This is typically a more challenging setting.

Events detected from CMU motion capture data

  • Events of left and right knee extracted from a walking sequence

  • Events of left and right knee extracted from a jumping sequence

Granger causal graph

    Fig 2. A Granger causality graph of the class DepositFloorR. Edges having top 5\% of the weights are drawn. Edges among femurs, tibias and feet describe bending legs. Direct edges from right hand and thumb to right and left tibia represent deposit motion with right hand.

    5. Results

    Table 1. Classification Performance on HDM05
    Cut1 79.31±8.1 69.5±8.49 62.8±6.8 74.6±9.9 87.4±4.7
    Cut2 45.9±13.0 51.8±10.5 50.9±12.0 57.5±11.4 69.3±9.5

Table 2: Confusion matrix of SGCGM result for Cut1
Confusion matrix of SGCGM result for Cut1


  • [1] C. W. J. Granger. “Investigating causal relations by econometric models and cross-spectral methods”, Econometrica, 37(3):424–438, 1969.
  • [2] A. Nedungadi, G. Rangarajan, N. Jain, and M. Ding. “Analyzing multiple spike trains with nonparametric granger causality”, Journal of Computational Neuro-science, 27:55–64, 2009.
  • [3] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, A. Weber “Documentation Mocap Database HDM05”, Technical report, No. CG-2007-2, ISSN 1610-8892, Universität Bonn, June 2007.


Full text:[pdf]

    Presentation slide:[pptx]

Spatio-Temporal Context Modeling for BoW-Based Video Classification

1. Abstract
We propose an autocorrelation Cox process that extends the traditional bag-of-words representation to model the spatio-temporal context within a video sequence. Bag-of-words models are effective tools for representing a video by a histogram of visual words that describe local appearance and motion. A major limitation of this model is its inability to encode the spatio-temporal structure of visual words pertaining to the context of the video. Several works have proposed to remedy this by learning the pairwise correlations between words. However, pairwise analysis leads to a quadratic increase in the number of features, making the models prone to overfitting and challenging to learn from data. The proposed autocorrelation Cox process model encodes, in a compact way, the contextual information within a video sequence, leading to improved classification performance. Spatio-temporal autocorrelations of visual words estimated from the Cox process are coupled with the information gain feature selection to discern the essential structure for the classification task. Experiments on crowd activity and traffic density dataset illustrate that the proposed model achieves state-of-the-art performance while providing intuitive spatio-temporal descriptors of the video context.

2. Spatio-temporal context model

  • Univariate Cox process

Cox process \(X\) is a point process defined on a locally finite subset \(S \subset \mathbb{R}^2\). The intensity \(\Lambda\) of Cox process \(X\) follows from stochastic process.  If intensity \(\Lambda\) is spatially constant, the Cox process follows homogeneous Poison process.  [1] proposed a Log Gaussian Cox Process(LGCP) to model the spatial point process.  The intensity process \(\Lambda\) of LGCP follows the log Gaussian process:

\Lambda &= \{\Lambda(s) : s \in \mathbb{R}^2\}, \\ \Lambda(s) &= \exp\{ Y(s)\}, \\ Y &\sim \mathcal{N}(\mu,\sigma^2)

Summary statistics of the Log-Gaussian Cox process X with intensity \(\Lambda\) are defined by the first and second order moments.  [1] suggest efficient non-parametric estimation of the mean intensity \(\rho\) and the correlation function \(c(r)\) for a univariate Cox process \(X\):

\hat{\rho} &= n/A(S),\\
\hat{g}(r)&=\frac{1}{2\pi r \hat{\rho}^2 A(S)}\sum_i \sum_{j \ne i}k_h(r – ||x_i – x_j ||_2)b_{ij},\\
k_h(a) &= \frac{3}{4h}(1-a^2/h^2) {\delta (a)},\nonumber \\
\delta(x) &=
1 & \text{if } -h \le x\le h\\
0 & \text{otherwise}

The following figure illustrates Cox correlation function in 1D space. (a)-(c) Kernel weights for corresponding radius. (d)  Cox process correlation function over different radii.

  • Autocorrelation Cox process

Each video \(n \in N\) is represented by \(K\) visual word point processes.  Point \({\bf x}\) consists of \((l,v,h,t)\) : the visual word label \(l\), vertical location \(v\), horizontal location \(h\), and the frame number \(t\).  For each point process \(X_k\), the univariate Cox process \(\hat{g}_k\) is estimated as described in Univariate Cox process. \(\hat{g}_k\) represents the spatio-temporal distribution of \(X_k\).  Our goal is to learn a common structure of each visual word within a video class.  

Correlation estimates of the univariate Cox process for all visual words are taken as the input feature of each video.  In order to infer the autocorrelation structure relevant for classification, we adopt the information gain feature selection principle.  From an initial pool of features D with C classes, information gain of each feature \({\bf d_f}\) is calculated.

IG(D, {\bf d_f}) &= H(D) – \sum_{i = 1 }^{V} \frac{|{\bf d}_f=i|}{|D|} H({{\bf d}_f=i}), \\
H(D) &= -\sum_{j=1}^{C}p_j(D) \log p_j(D)

Features with information gain higher than the threshold are selected as input features to an SVM classifier. Algorithm 1 summarizes the AutoCox modeling approach for video classification.

The following figure shows examples of the AutoCox using isotropic and anisotropic kernels. (a),(b) Point process from top-view. (c),(d) correlation for each kernel.

3. Experiments

For all datasets, local motion and visual appearance feature are extracted using the Dense Trajectories Video Descriptor [2].  Once the descriptor is extracted, K-means clustering is applied on each channel to construct the visual words sets and assign all descriptor to its closest center.  Each descriptor in a video has four independent labels each corresponding to four feature channels.   In the next phase, autocorrelation of each visual word is estimated using the Cox process.  The correlation function can be estimated for an isotropic kernel in 3D or an anisotropic kernel with independent space and time profiles.  Radii settings and the width of the kernel are decided empirically.  Finally, the discriminative autocorrelation element is selected using the information gain feature selection and an SVM model is learned for classification.

  • Abnormal group activitty detection

The UMN dataset [3] consist of group activities in two different states, normal and abnormal.  The video is recorded in three different scenes including one indoor and two outdoor scenes, with a total of eleven video sequences that start as normal state of natural walk and end in abnormal “disperse” motion of the group of people.

Information gain threshold is set as 0.21 during the training phase.  Correlation features with the gain higher than the threshold are selected.  Information gain of autocorrelation values from four different feature channel is shown in the following figure.

Impact of the information gain feature selection is shown in the following figure.  The blue dashed line is the AUC score of the classifier that uses all 3000 features of the AutoCox in each channel.  The red solid line shows the AUC scores as a function of the information gain threshold.  Selecting the higher information gain threshold results in selecting the fewer features.  It shows that accuracy dropped after choosing the gain threshold over 0.6 because too few features were selected.  The figure shows that selecting subset of features actually improves AUC over using all features.  This justifies our claim that not all correlation structure is useful for classification.

The abnormality detection results are reported in Table 1.  The AutoCox improves the result of the BoW model and achieves the best area under ROC curve(AUC) among state-of-the-art.  Our proposed model achieved the same AUC as Wu et al.[4].   However, our model was able to accomplish this in the training set using fewer frames.  [4] used 75% of normal clips from six sequences, whereas we used only 8.26% including normal and abnormal clips from all eleven sequences.

AUC comparison among the AutoCox, the BoW, and the BoWSPM models as a function of the training set size is reported in the following.  The AutoCox achieves higher AUC than the BoW or BoWSPM models across different sizes of the training set.  This result reveals that the structure of the point process distribution of each visual word conveys higher discriminative information compared to the histogram of occurrence frequency that the BoW model uses.  In fact, BoWSPM model also incorporates spatio-temporal context in coarse level, which outperforms the BoW model.  However, the partition grid of BoWSPM suggested by [5] too coarse to capture the precise spatio-temporal context in videos.

The following video illustrates examples of the point process and the autocorrelation on two videos.

  • Human action classification

The experiment is performed on the YouTube dataset [6].  The dataset consists of 1168 videos from 11 actions of basketball, biking, diving, golf swing, horse riding, soccer juggling, swing, tennis swing, trampoline jumping, volleyball spiking and walking.  Following [6], a 25 fold cross validation was used for classification.  In Table 2, the AutoCox model with \(M=10\) achieved the best result.  ST-Correlaton degrades the classification accuracy of BoW when it combines histogram of ST-Correlaton with that of BoW.  The autoCox model with \(M=2000\) visual words also deteriorates the accuracy of BoW.  This result suggests that the meaningful correlation structure can be extracted from using a number of visual words significantly smaller from that of traditional, BoW histograms.

4. Conclusion
We present a novel method that enables learning the spatio-temporal context in videos without suffering quadratic increase in the number of features.  The proposed AutoCox model is used to generate contextual autocorrelation spatio-temporal features, one per each visual word, to describe longer range co-occurrence patterns in space and time.  Information gain is then applied to extract meaningful features that are subsequently used to classify visual events and activities.  Our proposed model outperforms the BoW and achieved state-of-the-art performance for anomaly crowd activity detection and traffic density classification problem. 


  • [1] J. Møller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox Processes. Scandinavian Journal of Statistics, 1998.
  • [2] H. Wang, A. Kl¨aser, C. Schmid, and L. Cheng-Lin. Action Recognition by Dense Trajectories. CVPR, 2011.
  • [3] Unusual crowd activity dataset:
  • [4] S. Wu, B. E. Moore, and M. Shah. Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes. CVPR, 2010.
  • [5] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. CVPR, 2008.
  • [6] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos ”in the wild”. CVPR, 2009.

Full text: [pdf]

Presentation slide: [pptx]

Pose Invariant Activity Classification for Multi-floor Indoor Localization

1. Abstract

Smartphone based indoor localization caught massive interest of the localization community in recent years.  Combining pedestrian dead reckoning obtained using the phone’s inertial sensors with the GraphSLAM (Simultaneous Localization and Mapping) algorithm is one of the most effective approaches to reconstruct the entire pedestrian trajectory given a set of visited landmarks during movement.  A key to GraphSLAM-based localization is the detection of reliable landmarks, which are typically identified using visual cues or via NFC tags or QR codes.  Alternatively, human activity can be classified to detect organic landmarks such as visits to stairs and elevators while in movement.  We provide a novel human activity classification framework that is invariant to the pose of the smartphone.  Pose invariant features allow robust observation no matter how a user puts the phone in the pocket.  In addition, activity classification obtained by an SVM (Support Vector Machine) is used in a Bayesian framework with an HMM (Hidden Markov Model) that improves the activity inference based on temporal smoothness.  Furthermore, the HMM jointly infers activity and floor information, thus providing multi-floor indoor localization.  Our experiments show that the proposed framework detects landmarks accurately and enables multi-floor indoor localization from the pocket using GraphSLAM.

2. Motivation

  • We extended the design of pose invariant features for an activity classification task.  Pose is defined by how a person puts a smartphone in the pocket.  We show that pose invariant features can be used to successfully classify activities.
  • We designed a Hidden Markov Model that enables the integration of activity classification and floor inference.
  • We applied the GraphSLAM algorithm with our activity and floor detection framework to provide multi-floor localization in a building.

3. Framework

The overview of the framework is depicted in the following figure.

a) Feature Extraction

  • Pose Invariant Feature for IMU Sensors

IMU sensor readings depend on the pose of the smartphone, which is defined as the orientation of the phone in the pocket.  A pose-invariant system is strongly desirable because it frees the user from the restriction of keeping the smartphone in a particular orientation. [1] identified that the autocorrelation of acceleration data is invariant to the rotation of the accelerometer: 

f(\omega) &= \int exp(-i \omega t)s(t)dt \in \mathbb{C}^3, \\
F &= [f(\omega_1), f(\omega_2), \dots, f(\omega_n)] \in \mathbb{C}^{\mathrm{3}\times\mathrm{n}},\\
A &= F^*F.

The pose invariant property is inherited from the fact that the rotation matrix \(R\) is an orthogonal matrix, \(R^TR=I\).
\hat{s}(t) &= Rs(t), \nonumber \\
\hat{f}(\omega) &= \int exp(-i \omega t)\hat{s}(t)dt, \nonumber\\
&= R\int exp(-i \omega t)s(t)dt, \nonumber\\
&= Rf(\omega). \\
\hat{A} &= \hat{F}^*\hat{F} = F^*R^TRF = F^*F = A.

We extend the idea and apply the pose invariant features on both accelerometer and gyroscope sensor data to classify pedestrian activity.  The following figure illustrates pose invariant features.


  • Statistical Features from a Barometer

Barometer readings consistently fluctuate even if the sensor stays at the same level, thus we need to use some statistical features to get robust observations, as listed in the following table.

b) SVM Activity Classification
Rotation invariant features are extracted from accelerometer and gyroscope sensors and statistical features from barometer in each sliding window of an input sequence.  A linear SVM model classifies each sliding window sample and generates class probability from Platt’s scaling algorithm. SVM classification is limited to observations from one sliding window and has no ability to maintain reference to activities occurring in previous sliding windows.  Hence, there may arise sporadic misclassifications.  Classification results can be improved if we promote temporal smoothness on the activity sequence.

c) HMM Activity and Floor Inference
Activity classification results obtained from the SVM can be refined by an HMM if we define activities as states and suppress the unlikely state transitions.  Furthermore, by extending the definition of a state as a joint identification of the activity and the floor, state inference can integrate activity with floor inferences.  Such a combined state will help constrain the state transition.

  • Transition probability

We manually design the transition probabilities as shown in the following figure.  It results from the fact that activity transition occurs sparsely over time, thus probability of state transition is much lower than staying in the same state.  Moreover, the transition between certain activities is not possible.

  • Observation probability

Observation probabilities are obtained jointly from activity and floor likelihood.  Air pressure observation from a barometer \(y_{\textrm{floor}}\) is modeled by a mixture of Gaussians, where each floor forms a Gaussian distribution with \((\mu_{\textrm{floor}}, \sigma_{\textrm{floor}})\).  Activity class posterior \(p(s_{\textrm{act}_i}|y_\textrm{act})\) is estimated from Platt’s scaling on SVM decision values.

p(y|s_i) &= \frac{p(s_i | y)p(y)}{p(s_i)},\\
p(s_i|y) &= p(s_{\textrm{floor}_i}|y_{\textrm{floor}}) p(s_{\textrm{act}_i}|y_{\textrm{act}}),\\
p(y) &= \frac{1}{|T|}, p(s_i) = \frac{1}{|S|}.

d) Post-Process Rectification
The HMM smooths the state transition because the probability of state change is much smaller than that of staying in the same state.  Thus, the number of sporadic misclassifications from the SVM may be reduced.  In addition, activity inference of the HMM can be further improved by rectifying activities of stairs that involve no floor change to walk and, likewise, elevators to stand still.

e) Multi-floor GraphSLAM with Organic Landmark
GraphSLAM is an approach that optimizes a trajectory by representing it as a graph of constraints between consecutive positions and by minimizing an error to satisfy the constraints specified by the graph. In order to obtain an accurate trajectory, GraphSLAM requires a good number of landmarks visited more than once.  Detailed explanation and formulation can be found in the tutorial [2].
In this paper, we focus on providing organic landmarks which are stairs and elevators detected when a pedestrian moves inside a building.  The identity of landmarks can be determined by comparing WiFi visibility signatures such as the MAC address of a WiFi access point.  On training, WiFi visibility and the physical location of all landmarks are obtained as a reference landmark list.  Then, when a landmark is detected on testing, we compare the current WiFi visibility to all landmarks and take the physical location of the closest landmark in the reference list.

3. Experiments and Results
We experimented with the proposed method in a large, multi-floor office building with many stairs and elevators.  In our experiments, accelerometer, gyroscope and barometer data are recorded from Android smartphones at 50Hz.  Training data were recorded for a total of 10271 seconds performed by three subjects.  To help with annotation, the same action was performed repeatedly.  We defined 6 indoor activities of walking, taking stairs down, taking stairs up, standing still, taking elevator down and taking elevator up.  For test data, subjects walked inside a building naturally.  Test trajectories are composed of 12 sequences in total of 6160 seconds long.

a) Quantitative analysis
Activity classification results for various models are shown in the following table.  Columns show class accuracies for SVM, HMM and rectification results, respectively.  The HMM inference obtained from the Viterbi algorithm improves over the SVM classification for all activities.  Figure 4 shows that the HMM improves confusions on locomotive activities of walk, stair down and stair up.  It also improves misclassifications of the activities of stand still to elevator down and elevator up.  Such sporadic misclassifications were suppressed by temporal smoothing from the HMM.  Finally, post-processing with HMM inference further rectifies the walk activity which was misclassified as stairs.

b) Qualitative analysis
The following figure shows an example of the inference result.  The labels of activities are walking(WA), stair down(SD), stair up(SU), stand still(SS), elevator down(ED) and elevator up(EU) from bottom to top. In the given sequence, the user visited 7 floors including \(F0\) which is the basement.

We observe that SVM inference gives misclassification between locomotive activities of walking, stair down and stair up.  Those misclassifications are corrected by the HMM Viterbi algorithm.  The floor is correctly inferred by the Viterbi algorithm. Post-processing further corrects stairs down and stairs up activities that did not incur a floor change to walk.

c) Multi-floor GraphSLAM
This example describes a trajectory that included visits to 4 floors.

Initial trajectory obtained from smatphone PDR contains drift.

The following video shows how GraphSLAM improves trajectory using organic landmarks obtained from the proposed framework.

4. Conclusion
In this paper, we propose a novel framework that jointly infers activity and floor landmarks.  Pose invariant features from inertial sensors are adopted for SVM-based activity classification.  We design an HMM where an activity on each floor defines a state.  State transitions are designed to provide temporal smoothness of a state sequence.  The probability of an observation is estimated from an activity class probability provided by the SVM classification, which is multiplied by the floor likelihood from a mixture-of-Gaussians model.  We further rectify activity inferences when we observe activities of stairs and elevators which do not incur floor changes.  Our experiments show that the proposed framework accurately classifies activities and infers floors.  Finally, we showed that the organic landmarks obtained from our framework can be applied effectively to enable multi-floor GraphSLAM.


  • [1] T. Kobayashi, K. Hasida, and N. Otsu, “Rotation invariant feature extraction from 3D acceleration signals,” in ICASSP, 2011
  • [2] G. Grisetti, R. Kummerle, C. Stachniss, and W. Burgard, “A tutorial on graph-based SLAM,” in Intelligent Transportation Systems Magazine, IEEE, 2010.

Full text: [pdf]

Presentation slide: [pptx]