Spatio-Temporal Context Modeling for BoW-Based Video Classification

1. Abstract
We propose an autocorrelation Cox process that extends the traditional bag-of-words representation to model the spatio-temporal context within a video sequence. Bag-of-words models are effective tools for representing a video by a histogram of visual words that describe local appearance and motion. A major limitation of this model is its inability to encode the spatio-temporal structure of visual words pertaining to the context of the video. Several works have proposed to remedy this by learning the pairwise correlations between words. However, pairwise analysis leads to a quadratic increase in the number of features, making the models prone to overfitting and challenging to learn from data. The proposed autocorrelation Cox process model encodes, in a compact way, the contextual information within a video sequence, leading to improved classification performance. Spatio-temporal autocorrelations of visual words estimated from the Cox process are coupled with the information gain feature selection to discern the essential structure for the classification task. Experiments on crowd activity and traffic density dataset illustrate that the proposed model achieves state-of-the-art performance while providing intuitive spatio-temporal descriptors of the video context.

2. Spatio-temporal context model

  • Univariate Cox process

Cox process \(X\) is a point process defined on a locally finite subset \(S \subset \mathbb{R}^2\). The intensity \(\Lambda\) of Cox process \(X\) follows from stochastic process.  If intensity \(\Lambda\) is spatially constant, the Cox process follows homogeneous Poison process.  [1] proposed a Log Gaussian Cox Process(LGCP) to model the spatial point process.  The intensity process \(\Lambda\) of LGCP follows the log Gaussian process:

\Lambda &= \{\Lambda(s) : s \in \mathbb{R}^2\}, \\ \Lambda(s) &= \exp\{ Y(s)\}, \\ Y &\sim \mathcal{N}(\mu,\sigma^2)

Summary statistics of the Log-Gaussian Cox process X with intensity \(\Lambda\) are defined by the first and second order moments.  [1] suggest efficient non-parametric estimation of the mean intensity \(\rho\) and the correlation function \(c(r)\) for a univariate Cox process \(X\):

\hat{\rho} &= n/A(S),\\
\hat{g}(r)&=\frac{1}{2\pi r \hat{\rho}^2 A(S)}\sum_i \sum_{j \ne i}k_h(r – ||x_i – x_j ||_2)b_{ij},\\
k_h(a) &= \frac{3}{4h}(1-a^2/h^2) {\delta (a)},\nonumber \\
\delta(x) &=
1 & \text{if } -h \le x\le h\\
0 & \text{otherwise}

The following figure illustrates Cox correlation function in 1D space. (a)-(c) Kernel weights for corresponding radius. (d)  Cox process correlation function over different radii.

  • Autocorrelation Cox process

Each video \(n \in N\) is represented by \(K\) visual word point processes.  Point \({\bf x}\) consists of \((l,v,h,t)\) : the visual word label \(l\), vertical location \(v\), horizontal location \(h\), and the frame number \(t\).  For each point process \(X_k\), the univariate Cox process \(\hat{g}_k\) is estimated as described in Univariate Cox process. \(\hat{g}_k\) represents the spatio-temporal distribution of \(X_k\).  Our goal is to learn a common structure of each visual word within a video class.  

Correlation estimates of the univariate Cox process for all visual words are taken as the input feature of each video.  In order to infer the autocorrelation structure relevant for classification, we adopt the information gain feature selection principle.  From an initial pool of features D with C classes, information gain of each feature \({\bf d_f}\) is calculated.

IG(D, {\bf d_f}) &= H(D) – \sum_{i = 1 }^{V} \frac{|{\bf d}_f=i|}{|D|} H({{\bf d}_f=i}), \\
H(D) &= -\sum_{j=1}^{C}p_j(D) \log p_j(D)

Features with information gain higher than the threshold are selected as input features to an SVM classifier. Algorithm 1 summarizes the AutoCox modeling approach for video classification.

The following figure shows examples of the AutoCox using isotropic and anisotropic kernels. (a),(b) Point process from top-view. (c),(d) correlation for each kernel.

3. Experiments

For all datasets, local motion and visual appearance feature are extracted using the Dense Trajectories Video Descriptor [2].  Once the descriptor is extracted, K-means clustering is applied on each channel to construct the visual words sets and assign all descriptor to its closest center.  Each descriptor in a video has four independent labels each corresponding to four feature channels.   In the next phase, autocorrelation of each visual word is estimated using the Cox process.  The correlation function can be estimated for an isotropic kernel in 3D or an anisotropic kernel with independent space and time profiles.  Radii settings and the width of the kernel are decided empirically.  Finally, the discriminative autocorrelation element is selected using the information gain feature selection and an SVM model is learned for classification.

  • Abnormal group activitty detection

The UMN dataset [3] consist of group activities in two different states, normal and abnormal.  The video is recorded in three different scenes including one indoor and two outdoor scenes, with a total of eleven video sequences that start as normal state of natural walk and end in abnormal “disperse” motion of the group of people.

Information gain threshold is set as 0.21 during the training phase.  Correlation features with the gain higher than the threshold are selected.  Information gain of autocorrelation values from four different feature channel is shown in the following figure.

Impact of the information gain feature selection is shown in the following figure.  The blue dashed line is the AUC score of the classifier that uses all 3000 features of the AutoCox in each channel.  The red solid line shows the AUC scores as a function of the information gain threshold.  Selecting the higher information gain threshold results in selecting the fewer features.  It shows that accuracy dropped after choosing the gain threshold over 0.6 because too few features were selected.  The figure shows that selecting subset of features actually improves AUC over using all features.  This justifies our claim that not all correlation structure is useful for classification.

The abnormality detection results are reported in Table 1.  The AutoCox improves the result of the BoW model and achieves the best area under ROC curve(AUC) among state-of-the-art.  Our proposed model achieved the same AUC as Wu et al.[4].   However, our model was able to accomplish this in the training set using fewer frames.  [4] used 75% of normal clips from six sequences, whereas we used only 8.26% including normal and abnormal clips from all eleven sequences.

AUC comparison among the AutoCox, the BoW, and the BoWSPM models as a function of the training set size is reported in the following.  The AutoCox achieves higher AUC than the BoW or BoWSPM models across different sizes of the training set.  This result reveals that the structure of the point process distribution of each visual word conveys higher discriminative information compared to the histogram of occurrence frequency that the BoW model uses.  In fact, BoWSPM model also incorporates spatio-temporal context in coarse level, which outperforms the BoW model.  However, the partition grid of BoWSPM suggested by [5] too coarse to capture the precise spatio-temporal context in videos.

The following video illustrates examples of the point process and the autocorrelation on two videos.

  • Human action classification

The experiment is performed on the YouTube dataset [6].  The dataset consists of 1168 videos from 11 actions of basketball, biking, diving, golf swing, horse riding, soccer juggling, swing, tennis swing, trampoline jumping, volleyball spiking and walking.  Following [6], a 25 fold cross validation was used for classification.  In Table 2, the AutoCox model with \(M=10\) achieved the best result.  ST-Correlaton degrades the classification accuracy of BoW when it combines histogram of ST-Correlaton with that of BoW.  The autoCox model with \(M=2000\) visual words also deteriorates the accuracy of BoW.  This result suggests that the meaningful correlation structure can be extracted from using a number of visual words significantly smaller from that of traditional, BoW histograms.

4. Conclusion
We present a novel method that enables learning the spatio-temporal context in videos without suffering quadratic increase in the number of features.  The proposed AutoCox model is used to generate contextual autocorrelation spatio-temporal features, one per each visual word, to describe longer range co-occurrence patterns in space and time.  Information gain is then applied to extract meaningful features that are subsequently used to classify visual events and activities.  Our proposed model outperforms the BoW and achieved state-of-the-art performance for anomaly crowd activity detection and traffic density classification problem. 


  • [1] J. Møller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox Processes. Scandinavian Journal of Statistics, 1998.
  • [2] H. Wang, A. Kl¨aser, C. Schmid, and L. Cheng-Lin. Action Recognition by Dense Trajectories. CVPR, 2011.
  • [3] Unusual crowd activity dataset:
  • [4] S. Wu, B. E. Moore, and M. Shah. Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes. CVPR, 2010.
  • [5] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. CVPR, 2008.
  • [6] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos ”in the wild”. CVPR, 2009.

Full text: [pdf]

Presentation slide: [pptx]

Leave a Reply

Your email address will not be published. Required fields are marked *