**1. Overview**

Experiments on Animals with Attributes dataset demonstrate the performance of the proposed method and show its advantages over previous methods based on binary tagging and multi-class classification.Object classes are then predicted using these ratings.In this work, we propose a new method where each label/attribute can be assigned to a finite set of ordered ratings, from most to least relevant.The ordinal scale representation allows us to describe object classes more precisely than simple binary tagging.However, it is sometimes useful to predict the ratings of the labels or attributes endowed with an ordinal scale (e.g., “very important,” “important” or “not important”).Traditional visual classification approaches focus on predicting absence/presence of labels or attributes for images.

**2. Introduction**

Traditional binary tagging (presence/absence) of attributes makes the two classes indistinguishable. On the other hand, rating each attribute according to four ratings may lead to disambiguation based on different rating depiction of each object class. For instance, as illustrated in Figure 1, attribute “spot” is irrelevant to describing a polar bear. On the other hand, “coastal” is highly relevant because it is a habitat of the polar bear. Similarly, both “blue” and “swim” can be deemed less relevant and relevant attributes, respectively, to represent the polar bear. Ratings of labels can be particularly useful when dealing with object attributes. Attributes, such as color, shape, or lightness, are useful for succinct and intuitive characterization of objects.

In this work, we propose a method to formulate the multi-class classification problem as that of assigning the ordinal ratings in terms of relevance attributes, in contrast to other works that consider the binary tagging. We use a probabilistic ordinal regression to infer the ordinal ratings for the attributes. Experiments on recognizing animals classes on {\it Animals with attributes} (AwA) dataset [1] show potential benefits of the proposed.

Figure 1: Binary tagging v.s. ordinal ratings for attributes. Two classes are indistinguishable in traditional binary tagging. On the other hand, rating each attribute according to four ratings makes them distinguish from each other.

**3. Direct Attribute Prediction (DAP)**

In this section we briefly present the direct attribute prediction (DAP) method proposed in [1], generalized to R attribute ratings. A probability of a class \(z\) for a given input \(\mathbf{x}\) is defined as:

\[ P(z|\mathbf{x}) = P(z) \prod_{m=1}^M \frac{P(a_m^z|\mathbf{x})}{P(a_m^z)}, \qquad \qquad (1)\]

where \(P(z)\) is the class prior, \(P(a_m^z)\) is the attribute prior and \(P_m(a_m^z|\mathbf{x})\) is the image-attribute probability that we learn during training. Please refere to [1] for details.

**4. Probabilistic Model for Ordinal Regression**

**4.1. Oridinal Regression Model**

In an ideal, noise-free setting, an ordinal regression strategy can be interpreted in the following probabilistic setting

\[ P_{ideal} (y=c|\mathbf{x}) = \left\{\begin{array}{ll} 1 & \textrm{if } g(\mathbf{x}) \in (b_{c-1}, b_c]\\ 0 & \textrm{otherwise}\end{array} \right. ,\]

where \(g(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x})\), \(\phi(\mathbf{x})\) is the feature function which projects the image \(\mathbf{x}\) into the feature space and \(-\infty = b_0 \leq b_1 \leq b_2 \leq \cdots \leq b_R = \infty\). When we add the Gaussian noise \(\delta \sim N(\delta; 0, \sigma^2)\), the posterior probability becomes

\[

\begin{array}{ll}

P(y=c|\mathbf{x}) &= \int_{\delta} P_{ideal}(y=c|g(\mathbf{x})+\delta) \cdot

N(\delta;0,\sigma^2)d\delta \nonumber \\

& = \Phi\left(\frac{b_c -g(\mathbf{x})}{\sigma}\right) – \Phi\left(\frac{b_{c-1} – g(\mathbf{x})}{\sigma}\right),

\end{array}

\]

where \(\Phi(\cdot)\) is the standard normal cdf. We call this method normal cumulative density function scaling (NCDFS).

**4.2. Normal Cumulative Density Function Scaling (NCDFS)**

To find the optimal parameters of NCDFS for our purpose, we want to minimize the following log-likelihood function

\[

\begin{array}{ll}

\mathcal{L} &= -\sum_{i=1}^N \sum_{m=1}^M \log P(a_m^{y_i}|\mathbf{x}) =

-\sum_{i=1}^N \sum_{m=1}^M\sum_{c=1}^{R} I(a_m^{y_i} = c) \nonumber \\

& \quad \cdot \log \left( \Phi\left(\frac{b_c -g_m(\mathbf{x}_i)}{\sigma}\right)

– \Phi \left(\frac{b_{c-1} – g_m(\mathbf{x}_i)}{\sigma}\right)\right),

\end{array}

\]

where \(N\) is the number of images and \(y_i\) is the class label of the input \(\mathbf{x}_i\).

5. Data Set

We examine the performance of the proposed method on Animals with Attributes dataset (AwA) [1]. AwA contains 30,475 images, 50 animal classes and 85 attributes. We split the images into training and test data as described in [1]. The test data contains 10 classes: ‘chimpanzee’, ‘giant panda’, ‘hippopotamus’, ‘humpback whale’, ‘leopard’, ‘pig’, ‘raccoon’, ‘rat’, ‘persian cat’ and ‘seal’, consisting of 6,180 images. The training data contains 24,295 images of the other 40 classes.

We use 6 visual features provided from [1] : RGB color histograms, SIFT, rgSIFT, PHOG, SURF and local self-similarity historams. An image is represented by concatenating the 6 visual features, a 10,940 dimensional vector in total.

We map the attributes to four ratings, “irrelevant” < “less relevant” < “relevant” < “highly relevant.” These ratings were chosen because they shows the best performance. We will investigate how the number of rating values impacts prediction accuracy.

6. Results

**6.1 NCDFS settings**

For the feature function, we set \(\phi(\mathbf{x}) = [1, f(\mathbf{x})]^\top \), where \(f(\mathbf{x})\) is the output of Support Vector Ordinal Regressor (SVOR) for \(\mathbf{x}\). Using SVOR can benefit from computational advantages because it is faster to use the output of SVOR than to use the raw visual features.

We first predict the probability \(P_m(r|x)\) for the *mt*h attribute given the input image \(\mathbf{x}\) using NCDFS. Then we estimate a class of the input \(\mathbf{x}\) which maximizes Equation (1) with the attribute-to-class mapping (attribute table).

To set the baseline performance, we use a multi-class SVM (mSVM). mSVM takes the same input as SVOR. mSVM, however, considers that all levels (ratings) are treated equivalently in contrast to SVOR where the levels are ordered.

**6.2 Results**

Table 1: Average accuracy for detecting animal classes for [1], mSVM and NCDFS. The multi-class accuracy is measured by the mean of the diagonal of the confusion matrix. One can find that NCDFS outperforms the other methods.

Table 2: Area Under Curve (AUC) of 10 test classes for [1], mSVM and NCDFS (%).

Figure 2: Confusion matrices between 10 test classes of the AwA dataset for [1], mSVM and NCDFS.

Figure 3: Classification accuracies for various number of ratings. Figure 3 verifies that the proposed 4 rating scale would be the optimal for the AwA classification problem based on the ordinal regression.

**Reference**

- [1] C. H. Lampert, H. Nickisch, and S. Harmeling. “Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer”. In CVPR, 2009

Publication

- Attribute Rating for Classification of Visual Objects, Jongpil Kim and VladimirPavlovic, International Conference on Pattern Recognition (ICPR),
*2012*. [pdf]