Spatio-Temporal Context Modeling for BoW-Based Video Classification

1. Abstract
We propose an autocorrelation Cox process that extends the traditional bag-of-words representation to model the spatio-temporal context within a video sequence. Bag-of-words models are effective tools for representing a video by a histogram of visual words that describe local appearance and motion. A major limitation of this model is its inability to encode the spatio-temporal structure of visual words pertaining to the context of the video. Several works have proposed to remedy this by learning the pairwise correlations between words. However, pairwise analysis leads to a quadratic increase in the number of features, making the models prone to overfitting and challenging to learn from data. The proposed autocorrelation Cox process model encodes, in a compact way, the contextual information within a video sequence, leading to improved classification performance. Spatio-temporal autocorrelations of visual words estimated from the Cox process are coupled with the information gain feature selection to discern the essential structure for the classification task. Experiments on crowd activity and traffic density dataset illustrate that the proposed model achieves state-of-the-art performance while providing intuitive spatio-temporal descriptors of the video context.

2. Spatio-temporal context model

  • Univariate Cox process

Cox process \(X\) is a point process defined on a locally finite subset \(S \subset \mathbb{R}^2\). The intensity \(\Lambda\) of Cox process \(X\) follows from stochastic process.  If intensity \(\Lambda\) is spatially constant, the Cox process follows homogeneous Poison process.  [1] proposed a Log Gaussian Cox Process(LGCP) to model the spatial point process.  The intensity process \(\Lambda\) of LGCP follows the log Gaussian process:

\Lambda &= \{\Lambda(s) : s \in \mathbb{R}^2\}, \\ \Lambda(s) &= \exp\{ Y(s)\}, \\ Y &\sim \mathcal{N}(\mu,\sigma^2)

Summary statistics of the Log-Gaussian Cox process X with intensity \(\Lambda\) are defined by the first and second order moments.  [1] suggest efficient non-parametric estimation of the mean intensity \(\rho\) and the correlation function \(c(r)\) for a univariate Cox process \(X\):

\hat{\rho} &= n/A(S),\\
\hat{g}(r)&=\frac{1}{2\pi r \hat{\rho}^2 A(S)}\sum_i \sum_{j \ne i}k_h(r – ||x_i – x_j ||_2)b_{ij},\\
k_h(a) &= \frac{3}{4h}(1-a^2/h^2) {\delta (a)},\nonumber \\
\delta(x) &=
1 & \text{if } -h \le x\le h\\
0 & \text{otherwise}

The following figure illustrates Cox correlation function in 1D space. (a)-(c) Kernel weights for corresponding radius. (d)  Cox process correlation function over different radii.

  • Autocorrelation Cox process

Each video \(n \in N\) is represented by \(K\) visual word point processes.  Point \({\bf x}\) consists of \((l,v,h,t)\) : the visual word label \(l\), vertical location \(v\), horizontal location \(h\), and the frame number \(t\).  For each point process \(X_k\), the univariate Cox process \(\hat{g}_k\) is estimated as described in Univariate Cox process. \(\hat{g}_k\) represents the spatio-temporal distribution of \(X_k\).  Our goal is to learn a common structure of each visual word within a video class.  

Correlation estimates of the univariate Cox process for all visual words are taken as the input feature of each video.  In order to infer the autocorrelation structure relevant for classification, we adopt the information gain feature selection principle.  From an initial pool of features D with C classes, information gain of each feature \({\bf d_f}\) is calculated.

IG(D, {\bf d_f}) &= H(D) – \sum_{i = 1 }^{V} \frac{|{\bf d}_f=i|}{|D|} H({{\bf d}_f=i}), \\
H(D) &= -\sum_{j=1}^{C}p_j(D) \log p_j(D)

Features with information gain higher than the threshold are selected as input features to an SVM classifier. Algorithm 1 summarizes the AutoCox modeling approach for video classification.

The following figure shows examples of the AutoCox using isotropic and anisotropic kernels. (a),(b) Point process from top-view. (c),(d) correlation for each kernel.

3. Experiments

For all datasets, local motion and visual appearance feature are extracted using the Dense Trajectories Video Descriptor [2].  Once the descriptor is extracted, K-means clustering is applied on each channel to construct the visual words sets and assign all descriptor to its closest center.  Each descriptor in a video has four independent labels each corresponding to four feature channels.   In the next phase, autocorrelation of each visual word is estimated using the Cox process.  The correlation function can be estimated for an isotropic kernel in 3D or an anisotropic kernel with independent space and time profiles.  Radii settings and the width of the kernel are decided empirically.  Finally, the discriminative autocorrelation element is selected using the information gain feature selection and an SVM model is learned for classification.

  • Abnormal group activitty detection

The UMN dataset [3] consist of group activities in two different states, normal and abnormal.  The video is recorded in three different scenes including one indoor and two outdoor scenes, with a total of eleven video sequences that start as normal state of natural walk and end in abnormal “disperse” motion of the group of people.

Information gain threshold is set as 0.21 during the training phase.  Correlation features with the gain higher than the threshold are selected.  Information gain of autocorrelation values from four different feature channel is shown in the following figure.

Impact of the information gain feature selection is shown in the following figure.  The blue dashed line is the AUC score of the classifier that uses all 3000 features of the AutoCox in each channel.  The red solid line shows the AUC scores as a function of the information gain threshold.  Selecting the higher information gain threshold results in selecting the fewer features.  It shows that accuracy dropped after choosing the gain threshold over 0.6 because too few features were selected.  The figure shows that selecting subset of features actually improves AUC over using all features.  This justifies our claim that not all correlation structure is useful for classification.

The abnormality detection results are reported in Table 1.  The AutoCox improves the result of the BoW model and achieves the best area under ROC curve(AUC) among state-of-the-art.  Our proposed model achieved the same AUC as Wu et al.[4].   However, our model was able to accomplish this in the training set using fewer frames.  [4] used 75% of normal clips from six sequences, whereas we used only 8.26% including normal and abnormal clips from all eleven sequences.

AUC comparison among the AutoCox, the BoW, and the BoWSPM models as a function of the training set size is reported in the following.  The AutoCox achieves higher AUC than the BoW or BoWSPM models across different sizes of the training set.  This result reveals that the structure of the point process distribution of each visual word conveys higher discriminative information compared to the histogram of occurrence frequency that the BoW model uses.  In fact, BoWSPM model also incorporates spatio-temporal context in coarse level, which outperforms the BoW model.  However, the partition grid of BoWSPM suggested by [5] too coarse to capture the precise spatio-temporal context in videos.

The following video illustrates examples of the point process and the autocorrelation on two videos.

  • Human action classification

The experiment is performed on the YouTube dataset [6].  The dataset consists of 1168 videos from 11 actions of basketball, biking, diving, golf swing, horse riding, soccer juggling, swing, tennis swing, trampoline jumping, volleyball spiking and walking.  Following [6], a 25 fold cross validation was used for classification.  In Table 2, the AutoCox model with \(M=10\) achieved the best result.  ST-Correlaton degrades the classification accuracy of BoW when it combines histogram of ST-Correlaton with that of BoW.  The autoCox model with \(M=2000\) visual words also deteriorates the accuracy of BoW.  This result suggests that the meaningful correlation structure can be extracted from using a number of visual words significantly smaller from that of traditional, BoW histograms.

4. Conclusion
We present a novel method that enables learning the spatio-temporal context in videos without suffering quadratic increase in the number of features.  The proposed AutoCox model is used to generate contextual autocorrelation spatio-temporal features, one per each visual word, to describe longer range co-occurrence patterns in space and time.  Information gain is then applied to extract meaningful features that are subsequently used to classify visual events and activities.  Our proposed model outperforms the BoW and achieved state-of-the-art performance for anomaly crowd activity detection and traffic density classification problem. 


  • [1] J. Møller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox Processes. Scandinavian Journal of Statistics, 1998.
  • [2] H. Wang, A. Kl¨aser, C. Schmid, and L. Cheng-Lin. Action Recognition by Dense Trajectories. CVPR, 2011.
  • [3] Unusual crowd activity dataset:
  • [4] S. Wu, B. E. Moore, and M. Shah. Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes. CVPR, 2010.
  • [5] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. CVPR, 2008.
  • [6] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos ”in the wild”. CVPR, 2009.

Full text: [pdf]

Presentation slide: [pptx]

Pose Invariant Activity Classification for Multi-floor Indoor Localization

1. Abstract

Smartphone based indoor localization caught massive interest of the localization community in recent years.  Combining pedestrian dead reckoning obtained using the phone’s inertial sensors with the GraphSLAM (Simultaneous Localization and Mapping) algorithm is one of the most effective approaches to reconstruct the entire pedestrian trajectory given a set of visited landmarks during movement.  A key to GraphSLAM-based localization is the detection of reliable landmarks, which are typically identified using visual cues or via NFC tags or QR codes.  Alternatively, human activity can be classified to detect organic landmarks such as visits to stairs and elevators while in movement.  We provide a novel human activity classification framework that is invariant to the pose of the smartphone.  Pose invariant features allow robust observation no matter how a user puts the phone in the pocket.  In addition, activity classification obtained by an SVM (Support Vector Machine) is used in a Bayesian framework with an HMM (Hidden Markov Model) that improves the activity inference based on temporal smoothness.  Furthermore, the HMM jointly infers activity and floor information, thus providing multi-floor indoor localization.  Our experiments show that the proposed framework detects landmarks accurately and enables multi-floor indoor localization from the pocket using GraphSLAM.

2. Motivation

  • We extended the design of pose invariant features for an activity classification task.  Pose is defined by how a person puts a smartphone in the pocket.  We show that pose invariant features can be used to successfully classify activities.
  • We designed a Hidden Markov Model that enables the integration of activity classification and floor inference.
  • We applied the GraphSLAM algorithm with our activity and floor detection framework to provide multi-floor localization in a building.

3. Framework

The overview of the framework is depicted in the following figure.

a) Feature Extraction

  • Pose Invariant Feature for IMU Sensors

IMU sensor readings depend on the pose of the smartphone, which is defined as the orientation of the phone in the pocket.  A pose-invariant system is strongly desirable because it frees the user from the restriction of keeping the smartphone in a particular orientation. [1] identified that the autocorrelation of acceleration data is invariant to the rotation of the accelerometer: 

f(\omega) &= \int exp(-i \omega t)s(t)dt \in \mathbb{C}^3, \\
F &= [f(\omega_1), f(\omega_2), \dots, f(\omega_n)] \in \mathbb{C}^{\mathrm{3}\times\mathrm{n}},\\
A &= F^*F.

The pose invariant property is inherited from the fact that the rotation matrix \(R\) is an orthogonal matrix, \(R^TR=I\).
\hat{s}(t) &= Rs(t), \nonumber \\
\hat{f}(\omega) &= \int exp(-i \omega t)\hat{s}(t)dt, \nonumber\\
&= R\int exp(-i \omega t)s(t)dt, \nonumber\\
&= Rf(\omega). \\
\hat{A} &= \hat{F}^*\hat{F} = F^*R^TRF = F^*F = A.

We extend the idea and apply the pose invariant features on both accelerometer and gyroscope sensor data to classify pedestrian activity.  The following figure illustrates pose invariant features.


  • Statistical Features from a Barometer

Barometer readings consistently fluctuate even if the sensor stays at the same level, thus we need to use some statistical features to get robust observations, as listed in the following table.

b) SVM Activity Classification
Rotation invariant features are extracted from accelerometer and gyroscope sensors and statistical features from barometer in each sliding window of an input sequence.  A linear SVM model classifies each sliding window sample and generates class probability from Platt’s scaling algorithm. SVM classification is limited to observations from one sliding window and has no ability to maintain reference to activities occurring in previous sliding windows.  Hence, there may arise sporadic misclassifications.  Classification results can be improved if we promote temporal smoothness on the activity sequence.

c) HMM Activity and Floor Inference
Activity classification results obtained from the SVM can be refined by an HMM if we define activities as states and suppress the unlikely state transitions.  Furthermore, by extending the definition of a state as a joint identification of the activity and the floor, state inference can integrate activity with floor inferences.  Such a combined state will help constrain the state transition.

  • Transition probability

We manually design the transition probabilities as shown in the following figure.  It results from the fact that activity transition occurs sparsely over time, thus probability of state transition is much lower than staying in the same state.  Moreover, the transition between certain activities is not possible.

  • Observation probability

Observation probabilities are obtained jointly from activity and floor likelihood.  Air pressure observation from a barometer \(y_{\textrm{floor}}\) is modeled by a mixture of Gaussians, where each floor forms a Gaussian distribution with \((\mu_{\textrm{floor}}, \sigma_{\textrm{floor}})\).  Activity class posterior \(p(s_{\textrm{act}_i}|y_\textrm{act})\) is estimated from Platt’s scaling on SVM decision values.

p(y|s_i) &= \frac{p(s_i | y)p(y)}{p(s_i)},\\
p(s_i|y) &= p(s_{\textrm{floor}_i}|y_{\textrm{floor}}) p(s_{\textrm{act}_i}|y_{\textrm{act}}),\\
p(y) &= \frac{1}{|T|}, p(s_i) = \frac{1}{|S|}.

d) Post-Process Rectification
The HMM smooths the state transition because the probability of state change is much smaller than that of staying in the same state.  Thus, the number of sporadic misclassifications from the SVM may be reduced.  In addition, activity inference of the HMM can be further improved by rectifying activities of stairs that involve no floor change to walk and, likewise, elevators to stand still.

e) Multi-floor GraphSLAM with Organic Landmark
GraphSLAM is an approach that optimizes a trajectory by representing it as a graph of constraints between consecutive positions and by minimizing an error to satisfy the constraints specified by the graph. In order to obtain an accurate trajectory, GraphSLAM requires a good number of landmarks visited more than once.  Detailed explanation and formulation can be found in the tutorial [2].
In this paper, we focus on providing organic landmarks which are stairs and elevators detected when a pedestrian moves inside a building.  The identity of landmarks can be determined by comparing WiFi visibility signatures such as the MAC address of a WiFi access point.  On training, WiFi visibility and the physical location of all landmarks are obtained as a reference landmark list.  Then, when a landmark is detected on testing, we compare the current WiFi visibility to all landmarks and take the physical location of the closest landmark in the reference list.

3. Experiments and Results
We experimented with the proposed method in a large, multi-floor office building with many stairs and elevators.  In our experiments, accelerometer, gyroscope and barometer data are recorded from Android smartphones at 50Hz.  Training data were recorded for a total of 10271 seconds performed by three subjects.  To help with annotation, the same action was performed repeatedly.  We defined 6 indoor activities of walking, taking stairs down, taking stairs up, standing still, taking elevator down and taking elevator up.  For test data, subjects walked inside a building naturally.  Test trajectories are composed of 12 sequences in total of 6160 seconds long.

a) Quantitative analysis
Activity classification results for various models are shown in the following table.  Columns show class accuracies for SVM, HMM and rectification results, respectively.  The HMM inference obtained from the Viterbi algorithm improves over the SVM classification for all activities.  Figure 4 shows that the HMM improves confusions on locomotive activities of walk, stair down and stair up.  It also improves misclassifications of the activities of stand still to elevator down and elevator up.  Such sporadic misclassifications were suppressed by temporal smoothing from the HMM.  Finally, post-processing with HMM inference further rectifies the walk activity which was misclassified as stairs.

b) Qualitative analysis
The following figure shows an example of the inference result.  The labels of activities are walking(WA), stair down(SD), stair up(SU), stand still(SS), elevator down(ED) and elevator up(EU) from bottom to top. In the given sequence, the user visited 7 floors including \(F0\) which is the basement.

We observe that SVM inference gives misclassification between locomotive activities of walking, stair down and stair up.  Those misclassifications are corrected by the HMM Viterbi algorithm.  The floor is correctly inferred by the Viterbi algorithm. Post-processing further corrects stairs down and stairs up activities that did not incur a floor change to walk.

c) Multi-floor GraphSLAM
This example describes a trajectory that included visits to 4 floors.

Initial trajectory obtained from smatphone PDR contains drift.

The following video shows how GraphSLAM improves trajectory using organic landmarks obtained from the proposed framework.

4. Conclusion
In this paper, we propose a novel framework that jointly infers activity and floor landmarks.  Pose invariant features from inertial sensors are adopted for SVM-based activity classification.  We design an HMM where an activity on each floor defines a state.  State transitions are designed to provide temporal smoothness of a state sequence.  The probability of an observation is estimated from an activity class probability provided by the SVM classification, which is multiplied by the floor likelihood from a mixture-of-Gaussians model.  We further rectify activity inferences when we observe activities of stairs and elevators which do not incur floor changes.  Our experiments show that the proposed framework accurately classifies activities and infers floors.  Finally, we showed that the organic landmarks obtained from our framework can be applied effectively to enable multi-floor GraphSLAM.


  • [1] T. Kobayashi, K. Hasida, and N. Otsu, “Rotation invariant feature extraction from 3D acceleration signals,” in ICASSP, 2011
  • [2] G. Grisetti, R. Kummerle, C. Stachniss, and W. Burgard, “A tutorial on graph-based SLAM,” in Intelligent Transportation Systems Magazine, IEEE, 2010.

Full text: [pdf]

Presentation slide: [pptx]


Relevance Prediction of Image Labels

1. Overview

Most traditional image annotation approaches focus on tagging of images with labels: textual labels from a lexicon of words are either assigned or not assigned to an image based on its visual content or the content of similar images.  However, in many situations it may be advantageous to predict relevance of individual labels and not just their presence or absence.   Instead of inferring this relevance as a side-product of tagging, we propose an approach that directly focuses on label relevance prediction.  In the proposed setting, each label is assigned to a finite set of ordered relevance levels, akin to classical elicitation of relevance in user studies.  To induce dependencies between relevance predictions on related labels we extend the recently proposed Conditional Ordinal Random Field model to this image relevance assignment task.  Experiments on LabelMe and PASCAL VOC 2007 demonstrate the utility of the proposed model and potential advantages of the new relevance setting over traditional tagging.

2. Introduction


Figure 1: Prediction of relevance for four image labels. The second column represents the output of a traditional binary classification method. The probabilities in the fourth column are the estimates of relevance produced by the proposed method. Our proposed method predicts relevance of a label to the query image as shown in the third column. The highest relevance tag can be selected to predict the final relevance.  Red bars indicate ground truth relevance.

In this work we propose to formulate the image annotation problem as that of assigning relevance levels to possible image labels.  This is illustrated in Figure 1, where the labels are assigned to one of the fours relevance levels: “not relevant”, “weakly relevant”, “relevant” and “highly relevant.”  Unlike the classification setting the relevance levels here are ordered:  “not relevant” < “weakly relevant” < “relevant” < “highly relevant.”  In other words,  “not relevant” is closer to “weakly relevant” than it is to  “highly relevant.”  Assignment of discrete levels in contrast to continuous relevance scores has several potential advantages. Assignment of discrete levels at training (human data annotation) time is feasible for human annotators and as easy as traditional absent/present label annotation. At prediction time (automated annotation) the use of relevance levels is appealing as it yields small sets of discrete and easy to understand rank categories, precluding the need to form those categories from continuos scales after the prediction stage.

We also propose a computational framework for learning and prediction of image label relevances based on recently introduced Conditional Ordinal Regression Field model (CORF) [1].  CORF directly generalizes traditional computational tools for image annotation, such as Conditional Random Fields (CRFs), to the relevance prediction setting while taking advantage of correlations between predictions of different labels, either based on language or image contexts.  CORF has shown excellent results in the context of dynamic ordinal regression and prediction of emotion expression levels in facial video sequences.  We show that CORF can be adapted to the problem of label relevance prediction in image annotation.  Our results on images from two datasets, LabelMe and PASCAL VOC 2007, demonstrate potential benefits of this approach.

3. Relevance Level Conditional Ordinal Regression Field (RL-CORF)

In this section we suggest how the independent-label ordinal regression models can be extended to structured-label settings. This can be accomplished by merging the ordinal regression model with the traditional conditional random field (CRF) model. We describe how this task can be accomplisehd specifically in the context of image relevance annotation.

A typical node potential function of CRF is replaced with the ordinal regression model:
\sum_{r \in V} \mathbf{v}^{\top} \Psi_r^{(V)}(\mathbf{x}, y_r) \rightarrow \Gamma_r^{(V)}(\mathbf{x}, y_r; \mathbf{w},\mathbf{b}, \sigma).
Then we have
\Gamma_r^{(V)}(\mathbf{x}, y_r) = \sum_{c=1}^R I(y_r = c)\log \Big( \Phi(\frac{b_c -f}{\sigma}) – \Phi(\frac{b_{c-1} – f}{\sigma})\Big),
where \(f = \mathbf{w}^T \cdot \phi(\mathbf{x})\) and \(\phi(\mathbf{x})\) is a node feature function which maps the input \(\mathbf{x}\) to feature vector.
Finally, we have the RL-CORF model:
p(\mathbf{y}|\mathbf{x}) \propto \exp\Big( \sum_{r \in V} \Gamma_r^{(V)}(\mathbf{x}, y_r)  + \sum_{e \in E} \mathbf{u}^T\Psi^{(E)}(\mathbf{x},  y_r, y_s) \big)

4. Experimental Results

We evaluate our model in two sets of experiments. First, we consider the binary classification to verify the effect of RL-CORF for the LabelMe data. In this experiment, we contrast the performance of our model with Support Vector Machine (SVM) model. Next, we empirically demonstrate the performance RL-CORF on predicting the relevance levels. Support Vector Ordinal Regression (SVOR) and Conditional Random Field (CRF) are used to set two baselines in this setting.

Table 1: Results of the Binary Classification for LabelMe

AUC-PR 0.42 0.45

Table 1 shows the results of the binary classification. As one can see, RL-CORF performs better than SVM. Modeling dependencies of labels using a graphical structure had lead to improved binary prediction (tagging) performance.

Table 2: Results of Relevance Level Prediction (VUS scores)

LabelMe 0.056 0.057 0.071
PASCAL VOC 2007 0.039 0.036 0.054

We use the volume under a 4-dimensional surface (VUS) as the evaluation measure for relevance prediction. The VUS is a generalization of the AUC and its underlying probabilistic interpretation to ordinal regression problems.

Table 2 shows the prediction results of the relevance levals for LabelMe and PASCAL VOC 2007. Higher VUS score values imply better performance. One can see that RL-CORF outperforms CRF and SVOR with at least 24% higher VUS values.

Figure 2, 3, 4 and 5 depict prediction results of arbitrarily chosen images for LabelMe and PASCAL VOC 2007. We take the labels for which SVOR, CRF, and RL-CORF produce different predictions. Relevance levels for all other labels not depicted are identical for all three models. The probabilities in the second, third and fourth columns are the relevance estimates produced by SVOR, CRF and RL-CORF, respectively. They correspond to levels “not relevant”, “weakly relevant”, “relevant” and “highly relevant” from left to right. Red bars indicate ground truth relevance.

CRF uses the same graphical structure as RL-CORFs. However, because CRF does not use the ordinal scale outputs but considers all levels equally different/same, it often fails to estimate accurate labels relevances. On the other hands, RL-CORF predicts the correct relevance levels for which SVOR produces inferior estimates, which can be attributed to the effective collaboration among labels through the tree structure in RL-CORF.

Figure 2: Prediction of relevance levels for an image in LabelMe. RL-CORF, CRF and SVOR predict the relevance level correctly for 8, 2, and 1 label, respectively.

Figure 3: Prediction of relevance levels for an image in LabelMe. RL-CORF, CRF and SVOR predict the relevance level correctly for 10, 2, and 7 labels, respectively.

Figure 4: Prediction of relevance levels for an image in the PASCAL VOC 2007. RL-CORF, CRF and SVOR predict correct relevance for 6, 0, and 4 labels, respectively.

Figure 5: Prediction of relevance levels for an image in the PASCAL VOC 2007. RL-CORF, CRF and SVOR predict correct relevance for 10, 1, and 8 labels, respectively.

5. Conclusion

We proposed a new task of assigning image labels to ordinal relevance categories, a natural extension of traditional tagging. Our computational formalism is based on the structured ordinal regression method which estimates the relevance levels effectively on the ordinal scales while exploiting possible dependencies among labels, conditioned on the image context. Our experiments show that the proposed model outperforms competing methods including SVM, SVOR and CRF both in binary classification and the prediction of label relevance levels.


  • [1] M. Kim and V. Pavlovic. “Structured output ordinal regression for dynamic facial emotion intensity prediction”. Computer Vision – ECCV 2010. Daniilidis, Kostas, Maragos, Petros, Paragios and Nikos eds. 2010. pp. 649-662.

Attribute Rating for Classification of Visual Objects

1. Overview

Experiments on Animals with Attributes dataset demonstrate the performance of the proposed method and show its advantages over previous methods based on binary tagging and multi-class classification.Object classes are then predicted using these ratings.In this work, we propose a new method where each label/attribute can be assigned to a finite set of ordered ratings, from most to least relevant.The ordinal scale representation allows us to describe object classes more precisely than simple binary tagging.However, it is sometimes useful to predict the ratings of the labels or attributes endowed with an ordinal scale (e.g., “very important,” “important” or “not important”).Traditional visual classification approaches focus on predicting absence/presence of labels or attributes for images. 

2. Introduction

Traditional binary tagging (presence/absence) of attributes makes the two classes indistinguishable. On the other hand, rating each attribute according to four ratings may lead to disambiguation based on different rating depiction of each object class. For instance, as illustrated  in Figure 1, attribute “spot” is irrelevant to describing a polar bear. On the other hand, “coastal” is highly relevant because it is a habitat of the polar bear. Similarly, both “blue” and “swim” can be deemed less relevant and relevant attributes, respectively, to represent the polar bear. Ratings of labels can be particularly useful when dealing with object attributes. Attributes, such as color, shape, or lightness, are useful for succinct and intuitive characterization of objects.

In this work, we propose a method to formulate the multi-class classification problem as that of assigning the ordinal ratings in terms of relevance attributes, in contrast to other works that consider the binary tagging. We use a probabilistic ordinal regression to infer the ordinal ratings for the attributes. Experiments on recognizing animals classes on {\it Animals with attributes} (AwA) dataset [1] show potential benefits of the proposed. 

Figure 1: Binary tagging v.s. ordinal ratings for attributes. Two classes are indistinguishable in traditional binary tagging. On the other hand, rating each attribute according to four ratings makes them distinguish from each other.

3. Direct Attribute Prediction (DAP)

In this section we briefly present the direct attribute prediction (DAP) method proposed in [1], generalized to R attribute ratings. A probability of a class \(z\) for a given input \(\mathbf{x}\) is defined as:
\[ P(z|\mathbf{x}) = P(z) \prod_{m=1}^M \frac{P(a_m^z|\mathbf{x})}{P(a_m^z)}, \qquad \qquad (1)\]
where \(P(z)\) is the class prior, \(P(a_m^z)\) is the attribute prior and \(P_m(a_m^z|\mathbf{x})\) is the image-attribute probability that we learn during training. Please refere to [1] for details.

4.  Probabilistic Model for Ordinal Regression

4.1. Oridinal Regression Model

In an ideal, noise-free setting, an ordinal regression strategy can be interpreted in the following probabilistic setting
\[ P_{ideal} (y=c|\mathbf{x}) = \left\{\begin{array}{ll} 1 & \textrm{if } g(\mathbf{x}) \in (b_{c-1}, b_c]\\ 0 & \textrm{otherwise}\end{array} \right. ,\]
where \(g(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x})\), \(\phi(\mathbf{x})\) is the feature function which projects the image \(\mathbf{x}\) into the feature space and \(-\infty = b_0 \leq b_1 \leq b_2 \leq  \cdots \leq b_R = \infty\). When we add the Gaussian noise \(\delta \sim N(\delta; 0, \sigma^2)\), the posterior probability becomes
P(y=c|\mathbf{x}) &= \int_{\delta} P_{ideal}(y=c|g(\mathbf{x})+\delta) \cdot
N(\delta;0,\sigma^2)d\delta \nonumber \\
& = \Phi\left(\frac{b_c -g(\mathbf{x})}{\sigma}\right) – \Phi\left(\frac{b_{c-1} – g(\mathbf{x})}{\sigma}\right),
where \(\Phi(\cdot)\) is the standard normal cdf. We call this method normal cumulative density function scaling (NCDFS).

4.2. Normal Cumulative Density Function Scaling (NCDFS)

To find the optimal parameters of NCDFS for our purpose, we want to minimize the following log-likelihood function
\mathcal{L} &= -\sum_{i=1}^N \sum_{m=1}^M \log P(a_m^{y_i}|\mathbf{x}) =
-\sum_{i=1}^N \sum_{m=1}^M\sum_{c=1}^{R} I(a_m^{y_i} = c) \nonumber \\
& \quad \cdot \log \left( \Phi\left(\frac{b_c -g_m(\mathbf{x}_i)}{\sigma}\right)
– \Phi \left(\frac{b_{c-1} – g_m(\mathbf{x}_i)}{\sigma}\right)\right),
where \(N\) is the number of images and \(y_i\) is the class label of the input \(\mathbf{x}_i\).

5. Data Set

We examine the performance of the proposed method on Animals with Attributes dataset (AwA) [1]. AwA contains 30,475 images, 50 animal classes and 85 attributes. We split the images into training and test data as described in [1]. The test data contains 10 classes: ‘chimpanzee’, ‘giant panda’, ‘hippopotamus’, ‘humpback whale’, ‘leopard’, ‘pig’, ‘raccoon’, ‘rat’, ‘persian cat’ and ‘seal’, consisting of 6,180 images. The training data contains 24,295 images of the other 40 classes.

We use 6 visual features provided from [1] : RGB color histograms, SIFT, rgSIFT, PHOG, SURF and local self-similarity historams. An image is represented by concatenating the 6 visual features, a 10,940 dimensional vector in total. 

We map the attributes to four ratings, “irrelevant” < “less relevant” < “relevant” < “highly relevant.” These ratings were chosen because they shows the best performance. We will investigate how the number of rating values impacts prediction accuracy.

6. Results

6.1 NCDFS settings

For the feature function, we set \(\phi(\mathbf{x}) = [1, f(\mathbf{x})]^\top \), where \(f(\mathbf{x})\) is the output of Support Vector Ordinal Regressor (SVOR) for \(\mathbf{x}\). Using SVOR can benefit from computational advantages because it is faster to use the output of SVOR than to use the raw visual features.

We first predict the probability \(P_m(r|x)\) for the mth attribute given the input image \(\mathbf{x}\)  using NCDFS. Then we estimate a class of the input \(\mathbf{x}\) which maximizes Equation (1) with the attribute-to-class mapping (attribute table).

To set the baseline performance, we use a multi-class SVM (mSVM). mSVM takes the same input as SVOR. mSVM, however, considers that all levels (ratings) are treated equivalently in contrast to SVOR where the levels are ordered.

6.2 Results

Table 1: Average accuracy for detecting animal classes for [1], mSVM and NCDFS. The multi-class accuracy is measured by the mean of the diagonal of the confusion matrix. One can find that NCDFS outperforms the other methods.

Table 2: Area Under Curve (AUC) of 10 test classes for [1], mSVM and NCDFS (%).

Figure 2: Confusion matrices between 10 test classes of the AwA dataset for [1], mSVM and NCDFS.

Figure 3: Classification accuracies for various number of ratings. Figure 3 verifies that the proposed 4 rating scale would be the optimal for the AwA classification problem based on the ordinal regression.


  • [1] C. H. Lampert, H. Nickisch, and S. Harmeling. “Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer”. In CVPR, 2009


  • Attribute Rating for Classification of Visual Objects, Jongpil Kim and VladimirPavlovic, International Conference on Pattern Recognition (ICPR), 2012. [pdf]

Discovering Characteristic Landmarks on Ancient Coins

1. Goal of the Project
For a given Roman coin image, the goal is to

  1. to automatically find visual characteristics of the coin which make it distinguishable from the others
  2. to identify Roman Imperial Coinage (RIC) label of the coin

Figure 1. Sample obverse and reverse images of two an- cient Roman imperial coins. Both coins depict the same emperor (Domitian) on the obverse side but have distinct reverse depictions, resulting in different Roman Imperial Coinage (RIC) labels. The descriptions for them are (a) obverse: Laureate head right, Reverse: Minerva standing right on capital of rostral column with spear and shield to right owl, and (b) obverse: Laureate head right, Reverse: Pegasus right.

2. Motivation and Practical Applications

  • Understanding the ancient Roman coins could serve as references to understand the Roman empire
  • A reliable and automatic method to recognize the coins is necessary as the coin market is very active and many people are collecting coins as hobby
  • Ancient coins are becoming subject to a very large illicit trade. Recognition of the ancient Roman coins is not easy for novices but requires knowledge.

3. Coin Classification Using Convolutional Neural Networks (CNNs)

  • Fine-tuning on pre-trained CNN models

    – Considering the number of the coin images in our dataset (about 4500), the CNN model is likely to be under-fitted if we train it only on the coin dataset even if we use the data augmentation method
    – We adopt one of the most popular architecture proposed by Krizhevsky et al. [2] which is pre-trained on the ImagetNet with millions of natural images

  • Hierarchical classification

    – One Roman emperor includes several RIC labels as shown in Figure 1 while one RIC label belongs to exactly one emperor. Therefore, we can build a tree structure to represent the relationship between the Roman emperors and the RIC labels as depicted in Figure 2
    – Then the final probability is defined to be the product of the probabilities on the path from the root to the leaf as \[p(e|I_o) \cdot p(r|I_r) \cdot \delta(Pa(r) = e)\] where \(Pa(r)\) is the parent of node \(r\) and \(\delta(\cdot)\) is the indicator function

    Figure 2. Hierarchical classification for the RIC label

4. Finding Characteristic Landmarks on Roman Coins

  • Characteristic region set: the smallest set of local patches to represent the identity of the full image
  • Representation of the masked image
  • \[ f_\mathbf{I}(\mathbf{x}) = \left(\sum_{k=1}^{K} x_k \cdot \mathbf{I}_k\right) \otimes \mathbf{C} \] where \(\mathbf{I}\) is an image, \(\mathbf{I}_k\) the k-th subregion of image \(\mathbf{I}\), \(x_k \,\,(0 \leq x_k \leq 1)\) a transparency weight for the k-th subregion (0: transparent, 1: full intensity), \(\mathbf{C}\) a normalization vector

  • Objective function: finding an image which consists of the smallest set of regions but still can be correctly classified
  • \[ \begin{align*} \min_\mathbf{x} & \quad \ell_c\left(f_\mathbf{I} (\mathbf{x})\right) + \lambda \mathcal{R}(\mathbf{x}) \\ \mbox{s.t.} & \quad p\left(c|f_\mathbf{I}(\mathbf{x})\right) > p\left(c|f_\mathbf{I}(\mathbf{1})\right) – \epsilon,\\ & \quad \epsilon > 0, \nonumber \end{align*} \] where \(\ell_c(\cdot)\) is the loss function of the CNN model, \(\mathcal{R}(\mathbf{x})\) a regularization function and \(\lambda\) is a hyper parameter to control the regularization, and we place the constraint so that the prediction probability of the masked image \(f_\mathbf{I}(\mathbf{x})\) may differ from the original image \(f_\mathbf{I}(\mathbf{1})\) at most \(\epsilon\)

5. Coin Data Set

  • Coin images are collected from a numismatic website
  • 4526 Roman Imperial coins with RIC 314 labels and 96 Roman emperors
  • – Annotated for visual analysis (the original dataset only has numismatic annotation)
    – Both obverse and reverse images for each coin
    – Each emperor has at least 10 coins

  • High resolution images : at least 300-by-300 pixels

6. Experimental Results

  • Classification
    – Baseline method: SVM

    Table 1. Classification Accuracies for SVM and CNN

    Figure 3. Confusion matrices of CNN and SVM

  • Finding Landmarks
  • Figure 4. Visualization of landmarks as a function of \(\epsilon\)

    Figure 5. Discovered landmarks for obverse and reverse images

7. Conclusion

In this project, we proposed a novel method to discover the characteristic landmarks of the ancient Roman imperial coins. Our method automatically finds the smallest set of the discriminative regions sufficient to represent the identity of the full image and distinguish it from other available classes.

The qualitative analysis on the visualization of the discovered regions confirm that the proposed method is able to effectively find the class-specific regions but also it is consistent with the human expert annotations. The proposed framework to identify the ancient Roman imperial coins outperforms the previous approach in the domain of the coin classification by using the hierarchical structure of the RIC labels.


  • [1] J. Kim and V. Pavlovic. “Discovering Characteristic Landmarks on Ancient Coins using Convolutional Networks”. 2015. Available:
  • [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”. NIPS. 2012

Sequence Alignment

The problem of sequence alignment arises in many fields of science as a consequence of dealing with data that does not live in fixed dimensional Euclidean spaces. In computer vision, sequence alignment is an important first step used to solve problems such as the human activity analysis and recognition. The alignment can be used to establish a measure of similarity between two sequences of video frames or motion capture data, which can be subsequently employed for sequence classification or clustering.

A traditional way to address the alignment problem between two sequences relies on Dynamic Time Warping (DTW). DTW is typically solved using a combinatorial dynamic programming algorithm that searches for a globally optimal warping path, mapping the domain of one sequence onto the other. DTW and its derivates have shown great success in many practical alignment applications [3]. In practice the unconstrained warping of a generic DTW often fails to yield reasonable and robust results. Imposing constraints on the feasible warping paths has empirically shown to improve the classification performance. For instance, the Sakoe-Chiba band constraint, which restricts the maximum deviation of matching slices from the diagonal by \( p \)% of the sequence length, can result in substantially improved alignments depending on the choice of \( p \). Recently, [6] proposed an adaptive band approach that estimates function spaces of time warping paths, removing the need for a fixed \(p\). In this setting motion class-specific warping-path constraints are learned for each class that reflect the warping variations of samples within it. Nevertheless, imposing a proper set of constraints on DTW is still a challenging problem.

One additional property of DTW-based alignment is that it assumes pairing of individual data points: a sample at time \(t_{x}\) in sequence \(x\) is typically aligned with only one other sample at time \(t_{y}\) in sequence \(y\). In many practical applications it may be more desirable to establish pairing between groups of points: associating a temporal segment \( \mathbf{t}_{x} = [t_{x,1}, … , t_{x,2}] \) to another segment of the contrasting sequence, \( \mathbf{t}_{y} = [t_{y,1},…,t_{y,2}] \). The segment pairing can reflect alignment of e.g., segment sufficient statistics instead of the raw sample values, making the comparison more robust to individual sample differences. For example, in mocap data segment-level alignment can be less susceptible to individual subject differences while performing a certain motion or activity. Alignment on the segment level can also be justified as a way of comparing the local segment densities, in view of the Hilbert space embeddings of probability distributions. Additionally, point-to-point pairing deems DTW to be very sensitive to noise. While pairing convex combinations of the segments of the two contrasting sequences can impose additional filtering on data leading to a more robust alignment.

A natural way of solving the alignment problem is to find the aligned subspace (manifold in general) where the correlation of two sequences is maximized. Canonical correlation analysis (CCA) provides the necessary means for finding such a subspace between a pair of random variables. However, CCA in its original formulation does not respect the critical monotonicity property of any temporal alignment, which prevents arbitrary permutations of time indexes (e.g., self intersection of time, mapping \( t_{x,1} \rightarrow t_{y,2} \) and \( t_{x,2} \rightarrow t_{y,1} \) when \( t_{x,1} < t_{x,2}, t_{y,1} < t_{y,2} \)). To address this problem [8] proposed to formulate CCA of two vectors as a regression problem with a proper hyperbolic basis for the coefficient vector, guaranteeing monotonicity and the isotonic character of this solution. Their approach, however, does not immediately extend to multivariate time series and also does not explicitly model the segment alignment.

Given the above discussion, our approach was based on an isotonic extension to CCA to tackle the problem of sequence alignment. Unlike the traditional CCA, we introduced alternative constraints that guarantee monotonicity in the projected alignment space. We showed that these constraints are of quadratic nature, similar to the traditional CCA normalization constraints. This set of constraints is supplemented with non-negativity and norm-1 normalization constraints, which together allow alignment of subsegments and the Hilbert density embedding interpretation. We then presented an efficient solution to the isotonic CCA optimization based on iterative coordinate descent and non-negative least squares. The details and the results of applying IsoCCA is presented in our paper published in proceeding of International Conference on Computer Vision, 2011, Barcelona, Spain.

The code for IsoCCA is available at the software section.


Related Publications

  • [1] S. Shariat and V. Pavlovic. “Isotonic CCA for Sequence Alignment and Activity Recognition”. Internation Conference on Computer Vision. 2011.


Structured Learning for Multiple Object Tracking

Adaptive tracking-by-detection methods use previous tracking results to generate a new training set for object appearance, and update the current model to predict the object location in subsequent frames. Such approaches are typically bootstarpped by manual or semi-automatic initialization in the first several frames. However, most adaptive tracking-by-detection methods focus on tracking of a single object or multiple unrelated objects. Although one can trivially engage several single object trackers to track multiple objects, such solution is frequently suboptimal because it does not utilize the inter-object constraints or the obejct layout information [2].

We propose an adaptive tracking-by-detection method for multiple objects, inspired by recent work in [1] and [2]. The constraints for structured Support Vector Machine (SVM) in [1] are modified to localize multiple objects simultaneously with both appearance and layout information. Moreover, additional binary constraints are introduced to detect the existences of respective objects and to prevent possible model drift. Thus the method can handle frequent occlusion in multiple object tracking, as well as objects entering or leaving the scene. Those binary constraints make the optimization problem significantly different from the original Structured SVM [3]. The inter-object constraints, embedded in a linear programming technique similar to [2] for optimal position assignment, are applied to diminish false detections.

In single object tracking case, given a set of frames \(\{ x_1, x_2, \ldots, x_n \}\) indexed by time, and the corresponding set of labeling, i.e. bounding box, \(\{ \mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_n \}\), structured SVM tries to find a model \(f(x, \mathbf{y})\), such that the task of predicting object location in a testing frame \(x\) could be conquered by maximizing:
f(x, \mathbf{y}) = \langle \mathbf{w}, \Psi(x|_{\mathbf{y}}) \rangle, \tag{1}
where \(x|_{\mathbf{y}}\) is the patch of frame \(x\) within bounding box \(\mathbf{y}\) or the features extract from it, and \(\Psi(\cdot)\) is the mapping from input space to the implicit features space. The resulted optimization problem is the following,
\min_{\mathbf{w}, \mathbf{\xi}} \frac{1}{2} \| \mathbf{w} \|^{2} + C_1 \sum_{i=1}^n \xi_i \quad \mathrm{s.t.} \tag{2} \\
\langle \mathbf{w}, \Psi(x_i|_{\mathbf{y}_i}) – \Psi(x_i|_{\mathbf{y}}) \rangle \geq \Delta(\mathbf{y}_i, \mathbf{y}) – \xi_i, \quad i = 1, 2, \ldots, n, \quad \mathbf{y} \neq \mathbf{y}_i, \tag{3}
where \(\xi_i \geq 0 \), \(\mathbf{y} \neq \mathbf{y}_i\) implies bounding box \(\mathbf{y}\) in (3) could be anywhere else other than groundtruth \(\mathbf{y}\), and \(\Delta(\mathbf{y}_i, \mathbf{y}) = 1 -\frac{\mathbf{y}_i \cap \mathbf{y}}{\mathbf{y}_i \cup \mathbf{y}}\) is the loss of predicting \(\mathbf{y}\) when groundtruth is \(\mathbf{y}_i\).

The tracker could track slowly changing object due to its adaptive nature, but it is also likely to drift when the object is occluded or out of the scene. For selective adaptation and suppression of drifting, binary constraints are added. Suppose \(Z\) is the training set of the object detector, and each \(\mathbf{z} \in Z\) has the label \(l_{\mathbf{z}} \in \{ +1, -1 \}\). For each \(\mathbf{z} \in Z\), the binary constraint is
l_{\mathbf{z}}(\langle \mathbf{w}, \Psi(\mathbf{z}) \rangle + b) \geq 1 – \eta_{\mathbf{z}}, \quad \forall \mathbf{z} \in Z, \tag{4}
where \(b\) is the bias and \(\eta_{\mathbf{z}} \geq 0\). (4) favors the sample \(\mathbf{z}\) which is correctly classified by current model. The overall objective function (2) becomes
\min_{\mathbf{w}, b, \mathbf{\xi}, \mathbf{\eta}} \frac{1}{2} \| \mathbf{w} \|^{2} + C_1 \sum_{i=1}^n \xi_i + C_2 \sum_{\mathbf{z} \in Z} \eta_{\mathbf{z}}. \tag{5}
Objective function (5), constraints (3)(4) and slack variable constraints lead to a new optimization problem, which could be recognized as a combination of structured SVM and binary SVM.

The following videos are the tracking results with and without binary constraint (4), respectively. Without the constraint, the tracker still tries to track the male’s face even when it is occluded by the female in the second frame. Then the ongoing adaptation quickly adapts the tracker to the female’s face, leading to model drift. In contrast, with the constraint tracker knows its target, i.e., the male face is not presented in the second frame, and does not output the bounding box nor update the model.

With Constraint
Without Constraint

In multiple object case, compared with the single object version, we add constraint that two or more objects can not appear in the same location in one frame, as well as the objects layout information. The training set includes a frame set \(\{ x_1, x_2, \ldots, x_n \}\) indexed by time, and \(\{ \mathbf{Y}_1, \mathbf{Y}_2, \ldots, \mathbf{Y}_n \}\) is the correspond set of structured labels, where \(\mathbf{Y}_i = (\mathbf{y}_i^{(1)}, \mathbf{y}_i^{(2)}, \ldots, \mathbf{y}_i^{(K)})\) indicates the bounding boxes corresponding to \(K\) objects in frame \(i\). If the \(k\)-th object does not appear in the \(i\)-th frame, \(\mathbf{y}_i^{(k)}=null\). We design a function \(f(x, \mathbf{Y})\) such that the object locations \(\mathbf{Y}^*\) in frame \(x\) are given by maximizing
f(x, \mathbf{Y}) = \sum_{k = 1}^K \langle \mathbf{w}^{(k)}, \Psi(x|_{\mathbf{y}^{(k)}}) \rangle + \langle \mathbf{v}, \Phi(\mathbf{Y}; \mathbf{Y}_{i – 1}) \rangle, \tag{6}
where \(\mathbf{Y}_{i-1}\) is the layout in \(i-1\)-th frame and \(\Phi(\mathbf{Y}; \mathbf{Y}_{i – 1})\) is the layout feature of size \(\binom{K}{2} \times 2\), whose \(k\)-\(l\)-\(j\)-th element is
\Phi_{klj}(\mathbf{Y}; \mathbf{Y}_{i – 1}) = \left\{
\left| \left( \mathbf{y}_{i – 1}^{(k)}(j) – \mathbf{y}_{i – 1}^{(l)}(j) \right) – \left( \mathbf{y}^{(k)}(j) – \mathbf{y}^{(l)}(j) \right)\right| & \textrm{if $\mathbf{y}_{i – 1|i}^{(k|l)} \neq null$}\\
0 & \textrm{otherwise}
\end{array} \right., \tag{7}
while \(\mathbf{y}(1)\) and \(\mathbf{y}(2)\) are the horizontal and vertical coordinates of the bounding box \(\mathbf{y}\)’s center, respectively. The model leads the following optimization.
\min_{\mathbf{w}, \mathbf{v}, \mathbf{\xi}, \mathbf{\eta}} \frac{1}{2}(\sum_{k = 1}^K \| \mathbf{w}^{(k)} \|^{2} + \| \mathbf{v} \|^2) + C_1 \sum_{i=2}^n \xi_i + C_2 \sum_{k = 1}^K \sum_{\mathbf{z} \in Z} \eta_{\mathbf{z}} \quad \mathrm{s.t.} \tag{8} \\
\sum_{k = 1}^K \langle \mathbf{w}^{(k)}, \Psi(x_i|_{\mathbf{y}_i^{(k)}}) – \Psi(x_i|_{\mathbf{y}^{(k)}}) \rangle + \langle \mathbf{v}, \Phi(\mathbf{Y}_i; \mathbf{Y}_{i – 1}) – \Phi(\mathbf{Y}; \mathbf{Y}_{i – 1}) \rangle \geq \Delta^M(\mathbf{Y}_i, \mathbf{Y}) – \xi_i,& \\
\quad \forall i, \quad \mathbf{Y} \neq \mathbf{Y}_i,&
\end{split} \tag{9}\\
l_{\mathbf{z}^{(k)}}(\langle \mathbf{w}^{(k)}, \Psi(\mathbf{z}^{(k)}) \rangle + b^{(k)}) \geq 1 – \eta_{\mathbf{z}^{(k)}}, \quad \forall k, \quad \forall \mathbf{z}^{(k)} \in Z^{(k)}, \tag{10}
(9) is the structured constraint, where \(\mathbf{Y}_i\) is the groundtruth object location set of frame \(i\), \(\mathbf{Y}\) is the set of locations other than groundtruth, and \(\Delta^M(\mathbf{Y}_i, \mathbf{Y}) = \sum_{k = 1}^K \Delta(\mathbf{y}_i^{(k)}, \mathbf{y}^{(k)})\) is a combination of losses on each objects. (10) is the binary constraint.

Following figures shows the tracking results on 2 different video clips, ‘motinas-multi-face-fast’ and ‘toys’ respectively. The videos can be found here.

In most of the cases, the proposed method significantly outperforms other adaptive single object methods, which quickly get adapted to other wrong image patches. The only exception is the Struck result of the candy bag, since the bag has never been occluded. For the face video, the non-adaptive multiple object tracking method (Huang) works fine since a good face detector is available. However, the same method works poorly on the second video because neither enough training samples nor trained detectors are available. In contrast, the proposed method works equally well on both videos due to its adaptive nature.


  • [1] S. Hare, A. Saffari and P. H. Torr. “Struck: Structured output tracking with kernels”. IEEE International Conference on Computer Vision. 2011.
  • [2] H. Jiang, F. Fels and J. Little. “A linear programming approach for multiple object tracking”. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2007.
  • [3] I. Tsochantaridis, T. Hofmann, T. Joachims and Y. Altun. “Support vector machine learning for interdependent and structured output spaces”. International Conference on Machine Learning. 2004.


Multiple Object Tracking and Recognition

Multiple object tracking has numerous applications in video surveillance, human behavior analysis, visual navigation and sports video analysis. In contrast to applying independent individual trackers, the multi-object tracker handles all of the objects simultaneously in order to exploit the cross-object clues, e.g. occlusion constraints, layout constraints, motion constraints, etc. As a consequence multi-object trackers usually display better tracking performance than single object trackers.

However, most multi-object trackers to date do not utilize the identity information of the tracked objects. We argue that the joint tracking and recognition are beneficial due to their intrinsic temporal interdependencies. With all objects distinct, such as different human faces, the interdependencies become particularly appealing. For the tracker, the objects recognized as the same identity will have higher probability of belonging to the same track, while the probability for the objects with different identities will be lower. Thus the identity information could guide the adjustment of cross-object similarities, which will be used in tracking. For the recognition side, objects of one track are more likely to have the same identity. Therefore, the joint tracking and recognition alleviates the difficulty of recognition due to the insufficient visual information from one single frame. Besides the interdependencies between tracking and recognition, one could also utilize pairing constraint, i.e. one identity cannot be assigned to multiple objects within one video frame. The above mentioned relationships are illustrated in the following figure.

In this project, we are building a joint tracking and recognition system, in which the above mentioned interdependencies between the two tasks and pairing constraint could be naturally integrated. Preliminary experiments on human faces show its superior accuracy compared to independent tracking and recognition.


  • [1] A. Cohen and V. Pavlovic. “An Efficient IP Approach to Constrained Multiple Face Tracking and Recognition”. IEEE International Workshop on Socially Intelligent Surveillance and Monitoring. 2011.


Conditional State-Space Models (CSSM) [1]

We consider the problem of predicting a sequence of real-valued multivariate states from a given measurement sequence. Its typical application in computer vision is the task of motion estimation. State-Space Models are widely used generative probabilistic models for the problem. Instead of jointly modeling states and measurements, we propose a novel discriminative undirected graphical model which conditions the states on the measurements while exploiting the sequential structure  of the problem. The major benefits of this approach are: (1) It focuses on the ultimate prediction task while avoiding probably unnecessary effort in modeling the measurement density, (2) It relaxes generative models’ assumption that the measurements are independent given the states, and (3) The proposed inference algorithm takes linear time in the measurement dimension as opposed to the cubic time for Kalman filtering, which allows us to incorporate large numbers of measurement features. We show that the parameter learning can be cast as an instance of convex optimization. We also provide efficient convex optimization methods based on theorems from linear algebra. The performance of the proposed model is evaluated on both synthetic data and the human body pose estimation from silhouette videos.



  • [1] M. Kim and V. Pavlovic. “Conditional State Space Models for Discriminative Motion Estimation”. IEEE Int’l Conf. Computer Vision. 2007.
  • M. Kim, V. Pavlovic. Discriminative State Space Models – Snowbird Learning Workshop, 2007.