1. Overview
Most traditional image annotation approaches focus on tagging of images with labels: textual labels from a lexicon of words are either assigned or not assigned to an image based on its visual content or the content of similar images. However, in many situations it may be advantageous to predict relevance of individual labels and not just their presence or absence. Instead of inferring this relevance as a side-product of tagging, we propose an approach that directly focuses on label relevance prediction. In the proposed setting, each label is assigned to a finite set of ordered relevance levels, akin to classical elicitation of relevance in user studies. To induce dependencies between relevance predictions on related labels we extend the recently proposed Conditional Ordinal Random Field model to this image relevance assignment task. Experiments on LabelMe and PASCAL VOC 2007 demonstrate the utility of the proposed model and potential advantages of the new relevance setting over traditional tagging.
2. Introduction
Figure 1: Prediction of relevance for four image labels. The second column represents the output of a traditional binary classification method. The probabilities in the fourth column are the estimates of relevance produced by the proposed method. Our proposed method predicts relevance of a label to the query image as shown in the third column. The highest relevance tag can be selected to predict the final relevance. Red bars indicate ground truth relevance.
In this work we propose to formulate the image annotation problem as that of assigning relevance levels to possible image labels. This is illustrated in Figure 1, where the labels are assigned to one of the fours relevance levels: “not relevant”, “weakly relevant”, “relevant” and “highly relevant.” Unlike the classification setting the relevance levels here are ordered: “not relevant” < “weakly relevant” < “relevant” < “highly relevant.” In other words, “not relevant” is closer to “weakly relevant” than it is to “highly relevant.” Assignment of discrete levels in contrast to continuous relevance scores has several potential advantages. Assignment of discrete levels at training (human data annotation) time is feasible for human annotators and as easy as traditional absent/present label annotation. At prediction time (automated annotation) the use of relevance levels is appealing as it yields small sets of discrete and easy to understand rank categories, precluding the need to form those categories from continuos scales after the prediction stage.
We also propose a computational framework for learning and prediction of image label relevances based on recently introduced Conditional Ordinal Regression Field model (CORF) [1]. CORF directly generalizes traditional computational tools for image annotation, such as Conditional Random Fields (CRFs), to the relevance prediction setting while taking advantage of correlations between predictions of different labels, either based on language or image contexts. CORF has shown excellent results in the context of dynamic ordinal regression and prediction of emotion expression levels in facial video sequences. We show that CORF can be adapted to the problem of label relevance prediction in image annotation. Our results on images from two datasets, LabelMe and PASCAL VOC 2007, demonstrate potential benefits of this approach.
3. Relevance Level Conditional Ordinal Regression Field (RL-CORF)
In this section we suggest how the independent-label ordinal regression models can be extended to structured-label settings. This can be accomplished by merging the ordinal regression model with the traditional conditional random field (CRF) model. We describe how this task can be accomplisehd specifically in the context of image relevance annotation.
A typical node potential function of CRF is replaced with the ordinal regression model:
\[
\sum_{r \in V} \mathbf{v}^{\top} \Psi_r^{(V)}(\mathbf{x}, y_r) \rightarrow \Gamma_r^{(V)}(\mathbf{x}, y_r; \mathbf{w},\mathbf{b}, \sigma).
\]
Then we have
\[
\Gamma_r^{(V)}(\mathbf{x}, y_r) = \sum_{c=1}^R I(y_r = c)\log \Big( \Phi(\frac{b_c -f}{\sigma}) – \Phi(\frac{b_{c-1} – f}{\sigma})\Big),
\]
where \(f = \mathbf{w}^T \cdot \phi(\mathbf{x})\) and \(\phi(\mathbf{x})\) is a node feature function which maps the input \(\mathbf{x}\) to feature vector.
Finally, we have the RL-CORF model:
\[
p(\mathbf{y}|\mathbf{x}) \propto \exp\Big( \sum_{r \in V} \Gamma_r^{(V)}(\mathbf{x}, y_r) + \sum_{e \in E} \mathbf{u}^T\Psi^{(E)}(\mathbf{x}, y_r, y_s) \big)
\]
4. Experimental Results
We evaluate our model in two sets of experiments. First, we consider the binary classification to verify the effect of RL-CORF for the LabelMe data. In this experiment, we contrast the performance of our model with Support Vector Machine (SVM) model. Next, we empirically demonstrate the performance RL-CORF on predicting the relevance levels. Support Vector Ordinal Regression (SVOR) and Conditional Random Field (CRF) are used to set two baselines in this setting.
Table 1: Results of the Binary Classification for LabelMe
Methods | SVM | RL-CORF |
AUC-PR | 0.42 | 0.45 |
Table 1 shows the results of the binary classification. As one can see, RL-CORF performs better than SVM. Modeling dependencies of labels using a graphical structure had lead to improved binary prediction (tagging) performance.
Table 2: Results of Relevance Level Prediction (VUS scores)
Dataset | SVOR | CRF | RL-CORF |
LabelMe | 0.056 | 0.057 | 0.071 |
PASCAL VOC 2007 | 0.039 | 0.036 | 0.054 |
We use the volume under a 4-dimensional surface (VUS) as the evaluation measure for relevance prediction. The VUS is a generalization of the AUC and its underlying probabilistic interpretation to ordinal regression problems.
Table 2 shows the prediction results of the relevance levals for LabelMe and PASCAL VOC 2007. Higher VUS score values imply better performance. One can see that RL-CORF outperforms CRF and SVOR with at least 24% higher VUS values.
Figure 2, 3, 4 and 5 depict prediction results of arbitrarily chosen images for LabelMe and PASCAL VOC 2007. We take the labels for which SVOR, CRF, and RL-CORF produce different predictions. Relevance levels for all other labels not depicted are identical for all three models. The probabilities in the second, third and fourth columns are the relevance estimates produced by SVOR, CRF and RL-CORF, respectively. They correspond to levels “not relevant”, “weakly relevant”, “relevant” and “highly relevant” from left to right. Red bars indicate ground truth relevance.
CRF uses the same graphical structure as RL-CORFs. However, because CRF does not use the ordinal scale outputs but considers all levels equally different/same, it often fails to estimate accurate labels relevances. On the other hands, RL-CORF predicts the correct relevance levels for which SVOR produces inferior estimates, which can be attributed to the effective collaboration among labels through the tree structure in RL-CORF.
Figure 2: Prediction of relevance levels for an image in LabelMe. RL-CORF, CRF and SVOR predict the relevance level correctly for 8, 2, and 1 label, respectively.
Figure 3: Prediction of relevance levels for an image in LabelMe. RL-CORF, CRF and SVOR predict the relevance level correctly for 10, 2, and 7 labels, respectively.
Figure 4: Prediction of relevance levels for an image in the PASCAL VOC 2007. RL-CORF, CRF and SVOR predict correct relevance for 6, 0, and 4 labels, respectively.
Figure 5: Prediction of relevance levels for an image in the PASCAL VOC 2007. RL-CORF, CRF and SVOR predict correct relevance for 10, 1, and 8 labels, respectively.
5. Conclusion
We proposed a new task of assigning image labels to ordinal relevance categories, a natural extension of traditional tagging. Our computational formalism is based on the structured ordinal regression method which estimates the relevance levels effectively on the ordinal scales while exploiting possible dependencies among labels, conditioned on the image context. Our experiments show that the proposed model outperforms competing methods including SVM, SVOR and CRF both in binary classification and the prediction of label relevance levels.
References
- [1] M. Kim and V. Pavlovic. “Structured output ordinal regression for dynamic facial emotion intensity prediction”. Computer Vision – ECCV 2010. Daniilidis, Kostas, Maragos, Petros, Paragios and Nikos eds. 2010. pp. 649-662.