Sea level prediction using Gaussian Process Models

Sea level prediction is a complicated spatial temporal regression problem that draw a lot of attention these days. Understanding sea level behavior can help us know more about climate change and consequent effects.  However, predicting sea level is not really easy, when we have to deal with many problems like noisy obeservations, censored data, and so on. In this project, we focus on Gaussian process (GPs)  for modelling sea level because of its flexibility and effectiveness.

At the first stage, we are working on how to predict sea level using uncertain inputs with ordering  constraints. One of the inputs to predict the sea level is the information of ages, however due to the limitation of C14 dating technique, we can not obtain true ages of the records, but a noisy version of them. In addition, the true inputs must be in decreasing order. Utilizing this information, we propose a fast and accurate method to estimate the true inputs, and hyper-parameters in GP models.

String Kernel Methods for DNA sequence analysis

The research I have been working on is String Kernel Methods for DNA sequence analysis, under the supervision of Prof. Pavlovic. I focus on the problem of species-level identification based on short DNA fragments known as barcodes. Kernel methods approach classification by mapping original data into a set of points in the feature space that potentially makes it easier to detect complex relationship in the data. Thus, in turn, leads to learning algorithms that can exhibit higher classification accuracy and robustness.

Financial Time Series Analysis

Problem Definition

Financial decisions with respect to investing in industry based indices are often based on heuristics and non-standard methods or purely based on the company specific algorithmic methods. In-depth analysis of the historic stock market behavior and dynamics among different industries are critical for predicting the future trading outcomes. It is also important to identify which companies’ stock prices are leading or lagging and perform in a similar trend to other companies, so that we can identify groups of companies which behave similarly in a certain time period for better investment decision making.

Our Approaches

We are working on methods to enhance the applicability of time series analysis on historic stock market related data to identify specific groupings of companies with similar patterns/behavior and also verify applicability to GICS identification (Global Industry Classification Standard). The GICS sectors are defined as Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Industrials, Information Technology, Materials, Telecommunication Services and Utilities. We have worked on a variety of different machine learning and probabilistic methods as descibed below.
  • Finding similarities between time series sequences using sting kernel matching

The time series sequences of historic stock prices are represented by string sequences (after taking a sliding window based approach) defined from a finite alphabet and then we use the method proposed by Pavel et al [1] to find similarities between different string sequences using mismatch kernels. Here we only use a local mismacth kernel, where we find the simlarity between pair of strings within a specified time lag (for example within 2 weeks) unlike the global mistmatch kernel where the similarity is found between all pairs of possible strings, because in financial domain the impact of one time series to another is short term since the stock market is efficient and the longer term impact would be minimal. Then we do clustering using Affinity propogation algorithm to find similar pattern representing companies/tickers and see how well we are performing in accordance to the GICS classification standard and one set of results we have obtained is shown in the following table.

  • Granger Causality based analysis

Granger causality which is a statistical technique introduced by Nobel prize winner, Clive Granger to find whether a given time series has a causal relationship with another time series. We used this statistical hypothesis to test how different historic time series sequences of one company/ticker is affected by the lagged-time sequences of other companies/tickers. We performed this analysis for companies/tickers within and between different industry sectors (as defined by GICS classification) with different thresholds of statistical significance levels and analyzed how the resulting causal graphs vary over time. This type of analysis led to the idea of looking at time varying graphs[2] which identifies how the causal relationships between the companies/tickers change over time. The following graph shows the number of granger causal links within each industry sector for a specific time period in concern.

  • Sparse Regression based analysis
We have also tried Lasso regression on modelling the linear relationship between a single ticker/company’s time series with respect to other tickers’/companies’ time series. This analysis was important to find a sparse representation of the relationship between tickers within the same sector and between different sectors. The following bar graph shows the within sector links and between sector links distribution after using lasso regression with an appropriate penalty parameter set after validation set of time sequences.

  • Random Graph based analysis
We are also interested in looking at how we can model the relationship between different time series sequences using random graphs. We have tried some experiments using Exponential Random Graph (ERGM) based models and how we can model these financial time series using a set of network parameters such as density, number of mutual edges, Number of triangles, etc which indirectly controls the structure of the graphs and how sparse/dense they are. This type of analysis can also be used to model how the graphs and their parameters their structure changes over time, resulting in a dynamic graph analysis methods which could discover important links between companies and how they change over time.

[1] Kuksa,Huang & Pavlovic, Scalable Algorithms for String Kernels with Inexact Matching, Neural Information Processing Systems 2008 (NIPS 2008) 
[2] Kolar, Ahmed, Xing, Estimating Time Varying Networks, 2010, The Annals of Applied Statistics, Vol. 4, No. 1, 94–123


Conditional Ordinal Random Fields

Conditional Random Fields (CRFs) and Hidden Conditional Random Fields (HCRFs) are a staple of many sequence tagging and classification frameworks. An underlying assumption in those models is that the state sequences (tags), observed or latent, take their values from a set of nominal categories. These nominal categories typically indicate tag classes (e.g., part-of-speech tags) or clusters of similar measurements. However, in some sequence modeling settings it is more reasonable to assume that the tags indicate ordinal categories or ranks. Dynamic envelopes of sequences such as emotions or movements often exhibit intensities growing from neutral, through raising, to peak values.

In this project we develop models and algorithms for sequences of ranks or ordinal categories.  Our first model, CORF (Conditional Ordinal Random Field) [1]  extends is to ordinal latent data what CRF is to nominal data.  HCORF (Hidden Conditional Ordinal Random Field) [2] generalizes this idea to latent settings, where we cannot observe ordinal ranks but still want to model dynamics in this space. 

We have applied these models to analysis of facial emotions and facial emotion intensities, as well as classification of human activities from video sequences. 

Software: code.


  • [1] M. Kim and V. Pavlovic. “Structured output ordinal regression for dynamic facial emotion intensity prediction”. Computer Vision – ECCV 2010. Daniilidis, Kostas, Maragos, Petros, Paragios and Nikos eds. 2010. pp. 649-662.
  • [2] M. Kim and V. Pavlovic. “Hidden Conditional Ordinal Random Fields for Sequence Classification”. ECML/PKDD. 2010. pp. 51-65.

Sparse Granger Causality Graphs for Human Action Classification

1. Overview

Modeling and classification of human actions are important problems that have received significant attention in pattern recognition.  Mocap data is widely available and can serve as a good proxy for assessing action models before they are applied to video data. In this paper, we present a human action classification framework that extends the video analysis using Granger causality graphs to represent densely sampled human actions embodied in mocap data. We accomplish this by defining sparse events detected in movements of human body parts. The events are taken as nodes of a graph and edge weights are calculated from Granger causality between pairs of events. The graph describes human actions in terms of causal relationship among body parts movements.

Fig 1. Framework overview. Walking action is shown on the left column and jumping on the right. Top row depicts example of mocap sequences \(\bf{d_k}\). Two point processes on the events of right leg \(N_{k1}\) and left leg \(N_{k4}\) are shown for each action. Different temporal patterns are observed for different actions. From the point processes, Granger causality graph \(G_k\)  is constructed to represent its motion by causal relations between events. For walking sequence, the event in left leg causes right leg(\(G_{N_{k1}\to N_{k4}}\)). But the same causal relationship is not observed for jumping. Finally, a model that classifies causal graphs is learned for each action class.

2. Prior work

Granger causality is a statistical test to detect a relationship between two time series [1]. In prediction for a time series \(X\), it can be seen that another time series \(Y\) causes \(X\) if adding \(Y\) helps prediction of \(X\). Given two auto regressive (AR) models of \(X\)

\[ X_t=\sum_{j=1}^{\infty}a_{1j}X_{t-j}+\epsilon_{1t},\>\>\epsilon_{1t} \sim \mathcal{N}(0, \Sigma_1), \tag{1} \]

\[ X_t=\sum_{j=1}^{\infty}a_{2j}X_{t-j}+\sum_{j=1}^{\infty}b_{2j}Y_{t-j}+\epsilon_{2t}, \>\>\epsilon_{2t} \sim \mathcal{N}(0, \Sigma_2), \tag{2} \]

the causal power \(G_{Y \to X}\) is high if adding \(Y\) reduces prediction error of \(X\). Thus, Granger causality is defined as: \(G_{Y \to X}=\ln(\Sigma_1/\Sigma_2)\).

Non-parametric pairwise Granger causality is calculated as follows: given two point processes \(n_X\) and \(n_Y\), a power spectral matrix \(S_{XY}\) is defined as the Fourier transform of covariance of two point processes \(n_{X}, n_{Y},\) which is estimated using the multitaper function \(h_k(t_j)\) [2]:

\[ S_{XY}(f)=\frac{1}{2\pi KT}\sum_{k=1}^{K}\widetilde{n_X}(f,k)\widetilde{n_Y}(f,k)^*,\tag{3} \]
\[ \widetilde{n_i}(f,k)=\sum_{j}h_k(t_j)\exp(-i2 \pi f t_j),\tag{4}\]

and \(S_{XY}\) is factorized by Wilson’s algorithm as follows:

\[ S_{XY}(f)=H_{XY}(f) \Sigma_{XY} H^{*}_{XY}(f),\tag{5}\]

where \(H_{XY}\) is the transfer function which corresponds to coefficient of AR model, \(\Sigma\) corresponds to covariance matrix of error term of AR model and \(\ast\) represent conjugate transpose.
Nonparametric pairwise Granger causality of \(G_{n_Y \to n_X}\) for frequency \(f\) is finally calculated as:

\[ G_{n_{Y} \to n_{X}}(f)=\ln\frac{S_{XX}(f)}{S_{XX}(f)-(\Sigma_{YY}-\frac{\Sigma_{XY}^2}{\Sigma_{XX}}) |H_{XY}(f)|^2},\tag{6} \]

We will use this notion of causality in analyzing the mocap data, where many motions exhibit natural (quasi) periodic behavior.

3. Sparse Granger Causality Graph Model


Algorithm 1. Sparse Granger Causality Graph Model

Input: mocap dataset \( D = \{(\mathbf{d}_1, y_1), \ldots, (\mathbf{d}_n, y_n)\}\)

1. Generate a set of point processes \(N_k\) from \(\mathbf{d}_k\)

\({\bf for } \textrm{joint } i\)

\(N_{ki} \leftarrow \{1(t)|d_{ki}^t = \textrm{peak or valley}\}\)

\({\bf end for}\)

2. Estimate Granger causality graph \(G_k\) from \(N_k\)

\({\bf for}\) joint pair \(X,Y\)

Estimate spectrum \(S_{XY}\) from Eq. (3)

Factorize \(S_{XY}\) from Eq. (5)

Estimate Granger causality \(G_{n_Y \to n_X}\) from Eq. (6)

\({\bf end for}\)

3. Learn a sparse model

\({\bf for}\) action class \(c_i\)

Learn L1 regularized log. reg. model from \(\{G_k|y_k = c_i\}\)

\({\bf end for}\)

Our approach to building the sparse causality graph models of human actions is summarized in Algorithm 1.  The approach, denoted by SGCGM, has three major steps:

1. From each mocap sequence, events are detected and point processes are generated on the events. A mocap sequence consists of multiple time series densely recorded for each joint angle. From each dense time series of a joint angle, two different events of peak and valley are extracted through the extreme point detector. As a result, we define \(M\) events over all joints, and \(M\) point processes on events construct a set of point processes \(N_k\) for a mocap sequence \(\bf{d_k}\). We assume that representing joint angle trajectories with extreme points conveys enough information to construct causal structure among joints.

2. From a set of point processes \(N_k\) for a mocap sequence \(\bf{d_k}\), a Granger causality graph \(G_k\) is constructed. For each pair of point processes \(N_k\), a power spectrum \(S\) is estimated by the multitaper method.
Pairwise non-parametric Granger causality is calculated over \(F\) frequencies from the Equation \eqref{eq:G}. As a result, a Granger causality graph is represented in \(F\) adjacency matrices of size \(M\textrm{x}M\), one for each frequency band. Each node represents an event in the point process \(n_X\) and a pairwise causality power \(G_{n_Y \to n_X}\) is reflected as a direct edge weight between node \(X\) and \(Y\).2. From a set of point processes \(N_k\) for a mocap sequence \(\bf{d_k}\), a Granger causality graph \(G_k\) is constructed. For each pair of point processes \(N_k\), a power spectrum \(S\) is estimated by the multitaper method.

3. After each mocap sequence is converted into a causal graph, we learn a model that classifies the causal graph into one of the action classes. In order to capture sparse structure of the graph displayed across samples of each class, we exploit an L1 regularized logistic regression model. To represent the graph, we take Granger causality of all edges and frequencies as a feature.
A sparse regression model will learn regression coefficients between the input features and the class label. The classification model is learned for each action class from the training data and each test data is classified to the action that shows highest confidence level on the logistic function.

4. Experimental Evaluation

We performed experiments on the HDM05 dataset [3]. HDM05 contains mocap data in form of 29 skeletal joints, each of 2-3 rotation angles, resulting in 62 joint angle time series. From each time series, two events of peak and valley are detected. As a result, the number of total point processes M is set to 124. Also, we set the number of frequencies F in a power spectrum to 128. Upon computing the Granger causality graph for all 128 frequencies, we summarize them into 4 bands of high, mid-high, mid-low and low frequency by applying Hanning window in log scale.


SGCGM depends on the model learned for each class, which requires sufficient number of samples. HDM05 dataset is well-suited for our requirement with more than 100 classes and multiple trials performed by 5 subjects. We select 8 action classes that have a sufficient number of samples across subjects. The chosen classes are listed in Table 2.


We perform 5-fold cross validation in two different settings. In the first one the data is randomly samples across subjects so that both test and train data contain samples of motions performed by the same person. In the second setting we split the data so that data from the test subject was not used during training. This is typically a more challenging setting.

Events detected from CMU motion capture data

  • Events of left and right knee extracted from a walking sequence

  • Events of left and right knee extracted from a jumping sequence

Granger causal graph

    Fig 2. A Granger causality graph of the class DepositFloorR. Edges having top 5\% of the weights are drawn. Edges among femurs, tibias and feet describe bending legs. Direct edges from right hand and thumb to right and left tibia represent deposit motion with right hand.

    5. Results

    Table 1. Classification Performance on HDM05
    Cut1 79.31±8.1 69.5±8.49 62.8±6.8 74.6±9.9 87.4±4.7
    Cut2 45.9±13.0 51.8±10.5 50.9±12.0 57.5±11.4 69.3±9.5

Table 2: Confusion matrix of SGCGM result for Cut1
Confusion matrix of SGCGM result for Cut1


  • [1] C. W. J. Granger. “Investigating causal relations by econometric models and cross-spectral methods”, Econometrica, 37(3):424–438, 1969.
  • [2] A. Nedungadi, G. Rangarajan, N. Jain, and M. Ding. “Analyzing multiple spike trains with nonparametric granger causality”, Journal of Computational Neuro-science, 27:55–64, 2009.
  • [3] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, A. Weber “Documentation Mocap Database HDM05”, Technical report, No. CG-2007-2, ISSN 1610-8892, Universität Bonn, June 2007.


Full text:[pdf]

    Presentation slide:[pptx]

Spatio-Temporal Context Modeling for BoW-Based Video Classification

1. Abstract
We propose an autocorrelation Cox process that extends the traditional bag-of-words representation to model the spatio-temporal context within a video sequence. Bag-of-words models are effective tools for representing a video by a histogram of visual words that describe local appearance and motion. A major limitation of this model is its inability to encode the spatio-temporal structure of visual words pertaining to the context of the video. Several works have proposed to remedy this by learning the pairwise correlations between words. However, pairwise analysis leads to a quadratic increase in the number of features, making the models prone to overfitting and challenging to learn from data. The proposed autocorrelation Cox process model encodes, in a compact way, the contextual information within a video sequence, leading to improved classification performance. Spatio-temporal autocorrelations of visual words estimated from the Cox process are coupled with the information gain feature selection to discern the essential structure for the classification task. Experiments on crowd activity and traffic density dataset illustrate that the proposed model achieves state-of-the-art performance while providing intuitive spatio-temporal descriptors of the video context.

2. Spatio-temporal context model

  • Univariate Cox process

Cox process \(X\) is a point process defined on a locally finite subset \(S \subset \mathbb{R}^2\). The intensity \(\Lambda\) of Cox process \(X\) follows from stochastic process.  If intensity \(\Lambda\) is spatially constant, the Cox process follows homogeneous Poison process.  [1] proposed a Log Gaussian Cox Process(LGCP) to model the spatial point process.  The intensity process \(\Lambda\) of LGCP follows the log Gaussian process:

\Lambda &= \{\Lambda(s) : s \in \mathbb{R}^2\}, \\ \Lambda(s) &= \exp\{ Y(s)\}, \\ Y &\sim \mathcal{N}(\mu,\sigma^2)

Summary statistics of the Log-Gaussian Cox process X with intensity \(\Lambda\) are defined by the first and second order moments.  [1] suggest efficient non-parametric estimation of the mean intensity \(\rho\) and the correlation function \(c(r)\) for a univariate Cox process \(X\):

\hat{\rho} &= n/A(S),\\
\hat{g}(r)&=\frac{1}{2\pi r \hat{\rho}^2 A(S)}\sum_i \sum_{j \ne i}k_h(r – ||x_i – x_j ||_2)b_{ij},\\
k_h(a) &= \frac{3}{4h}(1-a^2/h^2) {\delta (a)},\nonumber \\
\delta(x) &=
1 & \text{if } -h \le x\le h\\
0 & \text{otherwise}

The following figure illustrates Cox correlation function in 1D space. (a)-(c) Kernel weights for corresponding radius. (d)  Cox process correlation function over different radii.

  • Autocorrelation Cox process

Each video \(n \in N\) is represented by \(K\) visual word point processes.  Point \({\bf x}\) consists of \((l,v,h,t)\) : the visual word label \(l\), vertical location \(v\), horizontal location \(h\), and the frame number \(t\).  For each point process \(X_k\), the univariate Cox process \(\hat{g}_k\) is estimated as described in Univariate Cox process. \(\hat{g}_k\) represents the spatio-temporal distribution of \(X_k\).  Our goal is to learn a common structure of each visual word within a video class.  

Correlation estimates of the univariate Cox process for all visual words are taken as the input feature of each video.  In order to infer the autocorrelation structure relevant for classification, we adopt the information gain feature selection principle.  From an initial pool of features D with C classes, information gain of each feature \({\bf d_f}\) is calculated.

IG(D, {\bf d_f}) &= H(D) – \sum_{i = 1 }^{V} \frac{|{\bf d}_f=i|}{|D|} H({{\bf d}_f=i}), \\
H(D) &= -\sum_{j=1}^{C}p_j(D) \log p_j(D)

Features with information gain higher than the threshold are selected as input features to an SVM classifier. Algorithm 1 summarizes the AutoCox modeling approach for video classification.

The following figure shows examples of the AutoCox using isotropic and anisotropic kernels. (a),(b) Point process from top-view. (c),(d) correlation for each kernel.

3. Experiments

For all datasets, local motion and visual appearance feature are extracted using the Dense Trajectories Video Descriptor [2].  Once the descriptor is extracted, K-means clustering is applied on each channel to construct the visual words sets and assign all descriptor to its closest center.  Each descriptor in a video has four independent labels each corresponding to four feature channels.   In the next phase, autocorrelation of each visual word is estimated using the Cox process.  The correlation function can be estimated for an isotropic kernel in 3D or an anisotropic kernel with independent space and time profiles.  Radii settings and the width of the kernel are decided empirically.  Finally, the discriminative autocorrelation element is selected using the information gain feature selection and an SVM model is learned for classification.

  • Abnormal group activitty detection

The UMN dataset [3] consist of group activities in two different states, normal and abnormal.  The video is recorded in three different scenes including one indoor and two outdoor scenes, with a total of eleven video sequences that start as normal state of natural walk and end in abnormal “disperse” motion of the group of people.

Information gain threshold is set as 0.21 during the training phase.  Correlation features with the gain higher than the threshold are selected.  Information gain of autocorrelation values from four different feature channel is shown in the following figure.

Impact of the information gain feature selection is shown in the following figure.  The blue dashed line is the AUC score of the classifier that uses all 3000 features of the AutoCox in each channel.  The red solid line shows the AUC scores as a function of the information gain threshold.  Selecting the higher information gain threshold results in selecting the fewer features.  It shows that accuracy dropped after choosing the gain threshold over 0.6 because too few features were selected.  The figure shows that selecting subset of features actually improves AUC over using all features.  This justifies our claim that not all correlation structure is useful for classification.

The abnormality detection results are reported in Table 1.  The AutoCox improves the result of the BoW model and achieves the best area under ROC curve(AUC) among state-of-the-art.  Our proposed model achieved the same AUC as Wu et al.[4].   However, our model was able to accomplish this in the training set using fewer frames.  [4] used 75% of normal clips from six sequences, whereas we used only 8.26% including normal and abnormal clips from all eleven sequences.

AUC comparison among the AutoCox, the BoW, and the BoWSPM models as a function of the training set size is reported in the following.  The AutoCox achieves higher AUC than the BoW or BoWSPM models across different sizes of the training set.  This result reveals that the structure of the point process distribution of each visual word conveys higher discriminative information compared to the histogram of occurrence frequency that the BoW model uses.  In fact, BoWSPM model also incorporates spatio-temporal context in coarse level, which outperforms the BoW model.  However, the partition grid of BoWSPM suggested by [5] too coarse to capture the precise spatio-temporal context in videos.

The following video illustrates examples of the point process and the autocorrelation on two videos.

  • Human action classification

The experiment is performed on the YouTube dataset [6].  The dataset consists of 1168 videos from 11 actions of basketball, biking, diving, golf swing, horse riding, soccer juggling, swing, tennis swing, trampoline jumping, volleyball spiking and walking.  Following [6], a 25 fold cross validation was used for classification.  In Table 2, the AutoCox model with \(M=10\) achieved the best result.  ST-Correlaton degrades the classification accuracy of BoW when it combines histogram of ST-Correlaton with that of BoW.  The autoCox model with \(M=2000\) visual words also deteriorates the accuracy of BoW.  This result suggests that the meaningful correlation structure can be extracted from using a number of visual words significantly smaller from that of traditional, BoW histograms.

4. Conclusion
We present a novel method that enables learning the spatio-temporal context in videos without suffering quadratic increase in the number of features.  The proposed AutoCox model is used to generate contextual autocorrelation spatio-temporal features, one per each visual word, to describe longer range co-occurrence patterns in space and time.  Information gain is then applied to extract meaningful features that are subsequently used to classify visual events and activities.  Our proposed model outperforms the BoW and achieved state-of-the-art performance for anomaly crowd activity detection and traffic density classification problem. 


  • [1] J. Møller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox Processes. Scandinavian Journal of Statistics, 1998.
  • [2] H. Wang, A. Kl¨aser, C. Schmid, and L. Cheng-Lin. Action Recognition by Dense Trajectories. CVPR, 2011.
  • [3] Unusual crowd activity dataset:
  • [4] S. Wu, B. E. Moore, and M. Shah. Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes. CVPR, 2010.
  • [5] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. CVPR, 2008.
  • [6] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos ”in the wild”. CVPR, 2009.

Full text: [pdf]

Presentation slide: [pptx]

Pose Invariant Activity Classification for Multi-floor Indoor Localization

1. Abstract

Smartphone based indoor localization caught massive interest of the localization community in recent years.  Combining pedestrian dead reckoning obtained using the phone’s inertial sensors with the GraphSLAM (Simultaneous Localization and Mapping) algorithm is one of the most effective approaches to reconstruct the entire pedestrian trajectory given a set of visited landmarks during movement.  A key to GraphSLAM-based localization is the detection of reliable landmarks, which are typically identified using visual cues or via NFC tags or QR codes.  Alternatively, human activity can be classified to detect organic landmarks such as visits to stairs and elevators while in movement.  We provide a novel human activity classification framework that is invariant to the pose of the smartphone.  Pose invariant features allow robust observation no matter how a user puts the phone in the pocket.  In addition, activity classification obtained by an SVM (Support Vector Machine) is used in a Bayesian framework with an HMM (Hidden Markov Model) that improves the activity inference based on temporal smoothness.  Furthermore, the HMM jointly infers activity and floor information, thus providing multi-floor indoor localization.  Our experiments show that the proposed framework detects landmarks accurately and enables multi-floor indoor localization from the pocket using GraphSLAM.

2. Motivation

  • We extended the design of pose invariant features for an activity classification task.  Pose is defined by how a person puts a smartphone in the pocket.  We show that pose invariant features can be used to successfully classify activities.
  • We designed a Hidden Markov Model that enables the integration of activity classification and floor inference.
  • We applied the GraphSLAM algorithm with our activity and floor detection framework to provide multi-floor localization in a building.

3. Framework

The overview of the framework is depicted in the following figure.

a) Feature Extraction

  • Pose Invariant Feature for IMU Sensors

IMU sensor readings depend on the pose of the smartphone, which is defined as the orientation of the phone in the pocket.  A pose-invariant system is strongly desirable because it frees the user from the restriction of keeping the smartphone in a particular orientation. [1] identified that the autocorrelation of acceleration data is invariant to the rotation of the accelerometer: 

f(\omega) &= \int exp(-i \omega t)s(t)dt \in \mathbb{C}^3, \\
F &= [f(\omega_1), f(\omega_2), \dots, f(\omega_n)] \in \mathbb{C}^{\mathrm{3}\times\mathrm{n}},\\
A &= F^*F.

The pose invariant property is inherited from the fact that the rotation matrix \(R\) is an orthogonal matrix, \(R^TR=I\).
\hat{s}(t) &= Rs(t), \nonumber \\
\hat{f}(\omega) &= \int exp(-i \omega t)\hat{s}(t)dt, \nonumber\\
&= R\int exp(-i \omega t)s(t)dt, \nonumber\\
&= Rf(\omega). \\
\hat{A} &= \hat{F}^*\hat{F} = F^*R^TRF = F^*F = A.

We extend the idea and apply the pose invariant features on both accelerometer and gyroscope sensor data to classify pedestrian activity.  The following figure illustrates pose invariant features.


  • Statistical Features from a Barometer

Barometer readings consistently fluctuate even if the sensor stays at the same level, thus we need to use some statistical features to get robust observations, as listed in the following table.

b) SVM Activity Classification
Rotation invariant features are extracted from accelerometer and gyroscope sensors and statistical features from barometer in each sliding window of an input sequence.  A linear SVM model classifies each sliding window sample and generates class probability from Platt’s scaling algorithm. SVM classification is limited to observations from one sliding window and has no ability to maintain reference to activities occurring in previous sliding windows.  Hence, there may arise sporadic misclassifications.  Classification results can be improved if we promote temporal smoothness on the activity sequence.

c) HMM Activity and Floor Inference
Activity classification results obtained from the SVM can be refined by an HMM if we define activities as states and suppress the unlikely state transitions.  Furthermore, by extending the definition of a state as a joint identification of the activity and the floor, state inference can integrate activity with floor inferences.  Such a combined state will help constrain the state transition.

  • Transition probability

We manually design the transition probabilities as shown in the following figure.  It results from the fact that activity transition occurs sparsely over time, thus probability of state transition is much lower than staying in the same state.  Moreover, the transition between certain activities is not possible.

  • Observation probability

Observation probabilities are obtained jointly from activity and floor likelihood.  Air pressure observation from a barometer \(y_{\textrm{floor}}\) is modeled by a mixture of Gaussians, where each floor forms a Gaussian distribution with \((\mu_{\textrm{floor}}, \sigma_{\textrm{floor}})\).  Activity class posterior \(p(s_{\textrm{act}_i}|y_\textrm{act})\) is estimated from Platt’s scaling on SVM decision values.

p(y|s_i) &= \frac{p(s_i | y)p(y)}{p(s_i)},\\
p(s_i|y) &= p(s_{\textrm{floor}_i}|y_{\textrm{floor}}) p(s_{\textrm{act}_i}|y_{\textrm{act}}),\\
p(y) &= \frac{1}{|T|}, p(s_i) = \frac{1}{|S|}.

d) Post-Process Rectification
The HMM smooths the state transition because the probability of state change is much smaller than that of staying in the same state.  Thus, the number of sporadic misclassifications from the SVM may be reduced.  In addition, activity inference of the HMM can be further improved by rectifying activities of stairs that involve no floor change to walk and, likewise, elevators to stand still.

e) Multi-floor GraphSLAM with Organic Landmark
GraphSLAM is an approach that optimizes a trajectory by representing it as a graph of constraints between consecutive positions and by minimizing an error to satisfy the constraints specified by the graph. In order to obtain an accurate trajectory, GraphSLAM requires a good number of landmarks visited more than once.  Detailed explanation and formulation can be found in the tutorial [2].
In this paper, we focus on providing organic landmarks which are stairs and elevators detected when a pedestrian moves inside a building.  The identity of landmarks can be determined by comparing WiFi visibility signatures such as the MAC address of a WiFi access point.  On training, WiFi visibility and the physical location of all landmarks are obtained as a reference landmark list.  Then, when a landmark is detected on testing, we compare the current WiFi visibility to all landmarks and take the physical location of the closest landmark in the reference list.

3. Experiments and Results
We experimented with the proposed method in a large, multi-floor office building with many stairs and elevators.  In our experiments, accelerometer, gyroscope and barometer data are recorded from Android smartphones at 50Hz.  Training data were recorded for a total of 10271 seconds performed by three subjects.  To help with annotation, the same action was performed repeatedly.  We defined 6 indoor activities of walking, taking stairs down, taking stairs up, standing still, taking elevator down and taking elevator up.  For test data, subjects walked inside a building naturally.  Test trajectories are composed of 12 sequences in total of 6160 seconds long.

a) Quantitative analysis
Activity classification results for various models are shown in the following table.  Columns show class accuracies for SVM, HMM and rectification results, respectively.  The HMM inference obtained from the Viterbi algorithm improves over the SVM classification for all activities.  Figure 4 shows that the HMM improves confusions on locomotive activities of walk, stair down and stair up.  It also improves misclassifications of the activities of stand still to elevator down and elevator up.  Such sporadic misclassifications were suppressed by temporal smoothing from the HMM.  Finally, post-processing with HMM inference further rectifies the walk activity which was misclassified as stairs.

b) Qualitative analysis
The following figure shows an example of the inference result.  The labels of activities are walking(WA), stair down(SD), stair up(SU), stand still(SS), elevator down(ED) and elevator up(EU) from bottom to top. In the given sequence, the user visited 7 floors including \(F0\) which is the basement.

We observe that SVM inference gives misclassification between locomotive activities of walking, stair down and stair up.  Those misclassifications are corrected by the HMM Viterbi algorithm.  The floor is correctly inferred by the Viterbi algorithm. Post-processing further corrects stairs down and stairs up activities that did not incur a floor change to walk.

c) Multi-floor GraphSLAM
This example describes a trajectory that included visits to 4 floors.

Initial trajectory obtained from smatphone PDR contains drift.

The following video shows how GraphSLAM improves trajectory using organic landmarks obtained from the proposed framework.

4. Conclusion
In this paper, we propose a novel framework that jointly infers activity and floor landmarks.  Pose invariant features from inertial sensors are adopted for SVM-based activity classification.  We design an HMM where an activity on each floor defines a state.  State transitions are designed to provide temporal smoothness of a state sequence.  The probability of an observation is estimated from an activity class probability provided by the SVM classification, which is multiplied by the floor likelihood from a mixture-of-Gaussians model.  We further rectify activity inferences when we observe activities of stairs and elevators which do not incur floor changes.  Our experiments show that the proposed framework accurately classifies activities and infers floors.  Finally, we showed that the organic landmarks obtained from our framework can be applied effectively to enable multi-floor GraphSLAM.


  • [1] T. Kobayashi, K. Hasida, and N. Otsu, “Rotation invariant feature extraction from 3D acceleration signals,” in ICASSP, 2011
  • [2] G. Grisetti, R. Kummerle, C. Stachniss, and W. Burgard, “A tutorial on graph-based SLAM,” in Intelligent Transportation Systems Magazine, IEEE, 2010.

Full text: [pdf]

Presentation slide: [pptx]


Relevance Prediction of Image Labels

1. Overview

Most traditional image annotation approaches focus on tagging of images with labels: textual labels from a lexicon of words are either assigned or not assigned to an image based on its visual content or the content of similar images.  However, in many situations it may be advantageous to predict relevance of individual labels and not just their presence or absence.   Instead of inferring this relevance as a side-product of tagging, we propose an approach that directly focuses on label relevance prediction.  In the proposed setting, each label is assigned to a finite set of ordered relevance levels, akin to classical elicitation of relevance in user studies.  To induce dependencies between relevance predictions on related labels we extend the recently proposed Conditional Ordinal Random Field model to this image relevance assignment task.  Experiments on LabelMe and PASCAL VOC 2007 demonstrate the utility of the proposed model and potential advantages of the new relevance setting over traditional tagging.

2. Introduction


Figure 1: Prediction of relevance for four image labels. The second column represents the output of a traditional binary classification method. The probabilities in the fourth column are the estimates of relevance produced by the proposed method. Our proposed method predicts relevance of a label to the query image as shown in the third column. The highest relevance tag can be selected to predict the final relevance.  Red bars indicate ground truth relevance.

In this work we propose to formulate the image annotation problem as that of assigning relevance levels to possible image labels.  This is illustrated in Figure 1, where the labels are assigned to one of the fours relevance levels: “not relevant”, “weakly relevant”, “relevant” and “highly relevant.”  Unlike the classification setting the relevance levels here are ordered:  “not relevant” < “weakly relevant” < “relevant” < “highly relevant.”  In other words,  “not relevant” is closer to “weakly relevant” than it is to  “highly relevant.”  Assignment of discrete levels in contrast to continuous relevance scores has several potential advantages. Assignment of discrete levels at training (human data annotation) time is feasible for human annotators and as easy as traditional absent/present label annotation. At prediction time (automated annotation) the use of relevance levels is appealing as it yields small sets of discrete and easy to understand rank categories, precluding the need to form those categories from continuos scales after the prediction stage.

We also propose a computational framework for learning and prediction of image label relevances based on recently introduced Conditional Ordinal Regression Field model (CORF) [1].  CORF directly generalizes traditional computational tools for image annotation, such as Conditional Random Fields (CRFs), to the relevance prediction setting while taking advantage of correlations between predictions of different labels, either based on language or image contexts.  CORF has shown excellent results in the context of dynamic ordinal regression and prediction of emotion expression levels in facial video sequences.  We show that CORF can be adapted to the problem of label relevance prediction in image annotation.  Our results on images from two datasets, LabelMe and PASCAL VOC 2007, demonstrate potential benefits of this approach.

3. Relevance Level Conditional Ordinal Regression Field (RL-CORF)

In this section we suggest how the independent-label ordinal regression models can be extended to structured-label settings. This can be accomplished by merging the ordinal regression model with the traditional conditional random field (CRF) model. We describe how this task can be accomplisehd specifically in the context of image relevance annotation.

A typical node potential function of CRF is replaced with the ordinal regression model:
\sum_{r \in V} \mathbf{v}^{\top} \Psi_r^{(V)}(\mathbf{x}, y_r) \rightarrow \Gamma_r^{(V)}(\mathbf{x}, y_r; \mathbf{w},\mathbf{b}, \sigma).
Then we have
\Gamma_r^{(V)}(\mathbf{x}, y_r) = \sum_{c=1}^R I(y_r = c)\log \Big( \Phi(\frac{b_c -f}{\sigma}) – \Phi(\frac{b_{c-1} – f}{\sigma})\Big),
where \(f = \mathbf{w}^T \cdot \phi(\mathbf{x})\) and \(\phi(\mathbf{x})\) is a node feature function which maps the input \(\mathbf{x}\) to feature vector.
Finally, we have the RL-CORF model:
p(\mathbf{y}|\mathbf{x}) \propto \exp\Big( \sum_{r \in V} \Gamma_r^{(V)}(\mathbf{x}, y_r)  + \sum_{e \in E} \mathbf{u}^T\Psi^{(E)}(\mathbf{x},  y_r, y_s) \big)

4. Experimental Results

We evaluate our model in two sets of experiments. First, we consider the binary classification to verify the effect of RL-CORF for the LabelMe data. In this experiment, we contrast the performance of our model with Support Vector Machine (SVM) model. Next, we empirically demonstrate the performance RL-CORF on predicting the relevance levels. Support Vector Ordinal Regression (SVOR) and Conditional Random Field (CRF) are used to set two baselines in this setting.

Table 1: Results of the Binary Classification for LabelMe

AUC-PR 0.42 0.45

Table 1 shows the results of the binary classification. As one can see, RL-CORF performs better than SVM. Modeling dependencies of labels using a graphical structure had lead to improved binary prediction (tagging) performance.

Table 2: Results of Relevance Level Prediction (VUS scores)

LabelMe 0.056 0.057 0.071
PASCAL VOC 2007 0.039 0.036 0.054

We use the volume under a 4-dimensional surface (VUS) as the evaluation measure for relevance prediction. The VUS is a generalization of the AUC and its underlying probabilistic interpretation to ordinal regression problems.

Table 2 shows the prediction results of the relevance levals for LabelMe and PASCAL VOC 2007. Higher VUS score values imply better performance. One can see that RL-CORF outperforms CRF and SVOR with at least 24% higher VUS values.

Figure 2, 3, 4 and 5 depict prediction results of arbitrarily chosen images for LabelMe and PASCAL VOC 2007. We take the labels for which SVOR, CRF, and RL-CORF produce different predictions. Relevance levels for all other labels not depicted are identical for all three models. The probabilities in the second, third and fourth columns are the relevance estimates produced by SVOR, CRF and RL-CORF, respectively. They correspond to levels “not relevant”, “weakly relevant”, “relevant” and “highly relevant” from left to right. Red bars indicate ground truth relevance.

CRF uses the same graphical structure as RL-CORFs. However, because CRF does not use the ordinal scale outputs but considers all levels equally different/same, it often fails to estimate accurate labels relevances. On the other hands, RL-CORF predicts the correct relevance levels for which SVOR produces inferior estimates, which can be attributed to the effective collaboration among labels through the tree structure in RL-CORF.

Figure 2: Prediction of relevance levels for an image in LabelMe. RL-CORF, CRF and SVOR predict the relevance level correctly for 8, 2, and 1 label, respectively.

Figure 3: Prediction of relevance levels for an image in LabelMe. RL-CORF, CRF and SVOR predict the relevance level correctly for 10, 2, and 7 labels, respectively.

Figure 4: Prediction of relevance levels for an image in the PASCAL VOC 2007. RL-CORF, CRF and SVOR predict correct relevance for 6, 0, and 4 labels, respectively.

Figure 5: Prediction of relevance levels for an image in the PASCAL VOC 2007. RL-CORF, CRF and SVOR predict correct relevance for 10, 1, and 8 labels, respectively.

5. Conclusion

We proposed a new task of assigning image labels to ordinal relevance categories, a natural extension of traditional tagging. Our computational formalism is based on the structured ordinal regression method which estimates the relevance levels effectively on the ordinal scales while exploiting possible dependencies among labels, conditioned on the image context. Our experiments show that the proposed model outperforms competing methods including SVM, SVOR and CRF both in binary classification and the prediction of label relevance levels.


  • [1] M. Kim and V. Pavlovic. “Structured output ordinal regression for dynamic facial emotion intensity prediction”. Computer Vision – ECCV 2010. Daniilidis, Kostas, Maragos, Petros, Paragios and Nikos eds. 2010. pp. 649-662.

Attribute Rating for Classification of Visual Objects

1. Overview

Experiments on Animals with Attributes dataset demonstrate the performance of the proposed method and show its advantages over previous methods based on binary tagging and multi-class classification.Object classes are then predicted using these ratings.In this work, we propose a new method where each label/attribute can be assigned to a finite set of ordered ratings, from most to least relevant.The ordinal scale representation allows us to describe object classes more precisely than simple binary tagging.However, it is sometimes useful to predict the ratings of the labels or attributes endowed with an ordinal scale (e.g., “very important,” “important” or “not important”).Traditional visual classification approaches focus on predicting absence/presence of labels or attributes for images. 

2. Introduction

Traditional binary tagging (presence/absence) of attributes makes the two classes indistinguishable. On the other hand, rating each attribute according to four ratings may lead to disambiguation based on different rating depiction of each object class. For instance, as illustrated  in Figure 1, attribute “spot” is irrelevant to describing a polar bear. On the other hand, “coastal” is highly relevant because it is a habitat of the polar bear. Similarly, both “blue” and “swim” can be deemed less relevant and relevant attributes, respectively, to represent the polar bear. Ratings of labels can be particularly useful when dealing with object attributes. Attributes, such as color, shape, or lightness, are useful for succinct and intuitive characterization of objects.

In this work, we propose a method to formulate the multi-class classification problem as that of assigning the ordinal ratings in terms of relevance attributes, in contrast to other works that consider the binary tagging. We use a probabilistic ordinal regression to infer the ordinal ratings for the attributes. Experiments on recognizing animals classes on {\it Animals with attributes} (AwA) dataset [1] show potential benefits of the proposed. 

Figure 1: Binary tagging v.s. ordinal ratings for attributes. Two classes are indistinguishable in traditional binary tagging. On the other hand, rating each attribute according to four ratings makes them distinguish from each other.

3. Direct Attribute Prediction (DAP)

In this section we briefly present the direct attribute prediction (DAP) method proposed in [1], generalized to R attribute ratings. A probability of a class \(z\) for a given input \(\mathbf{x}\) is defined as:
\[ P(z|\mathbf{x}) = P(z) \prod_{m=1}^M \frac{P(a_m^z|\mathbf{x})}{P(a_m^z)}, \qquad \qquad (1)\]
where \(P(z)\) is the class prior, \(P(a_m^z)\) is the attribute prior and \(P_m(a_m^z|\mathbf{x})\) is the image-attribute probability that we learn during training. Please refere to [1] for details.

4.  Probabilistic Model for Ordinal Regression

4.1. Oridinal Regression Model

In an ideal, noise-free setting, an ordinal regression strategy can be interpreted in the following probabilistic setting
\[ P_{ideal} (y=c|\mathbf{x}) = \left\{\begin{array}{ll} 1 & \textrm{if } g(\mathbf{x}) \in (b_{c-1}, b_c]\\ 0 & \textrm{otherwise}\end{array} \right. ,\]
where \(g(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x})\), \(\phi(\mathbf{x})\) is the feature function which projects the image \(\mathbf{x}\) into the feature space and \(-\infty = b_0 \leq b_1 \leq b_2 \leq  \cdots \leq b_R = \infty\). When we add the Gaussian noise \(\delta \sim N(\delta; 0, \sigma^2)\), the posterior probability becomes
P(y=c|\mathbf{x}) &= \int_{\delta} P_{ideal}(y=c|g(\mathbf{x})+\delta) \cdot
N(\delta;0,\sigma^2)d\delta \nonumber \\
& = \Phi\left(\frac{b_c -g(\mathbf{x})}{\sigma}\right) – \Phi\left(\frac{b_{c-1} – g(\mathbf{x})}{\sigma}\right),
where \(\Phi(\cdot)\) is the standard normal cdf. We call this method normal cumulative density function scaling (NCDFS).

4.2. Normal Cumulative Density Function Scaling (NCDFS)

To find the optimal parameters of NCDFS for our purpose, we want to minimize the following log-likelihood function
\mathcal{L} &= -\sum_{i=1}^N \sum_{m=1}^M \log P(a_m^{y_i}|\mathbf{x}) =
-\sum_{i=1}^N \sum_{m=1}^M\sum_{c=1}^{R} I(a_m^{y_i} = c) \nonumber \\
& \quad \cdot \log \left( \Phi\left(\frac{b_c -g_m(\mathbf{x}_i)}{\sigma}\right)
– \Phi \left(\frac{b_{c-1} – g_m(\mathbf{x}_i)}{\sigma}\right)\right),
where \(N\) is the number of images and \(y_i\) is the class label of the input \(\mathbf{x}_i\).

5. Data Set

We examine the performance of the proposed method on Animals with Attributes dataset (AwA) [1]. AwA contains 30,475 images, 50 animal classes and 85 attributes. We split the images into training and test data as described in [1]. The test data contains 10 classes: ‘chimpanzee’, ‘giant panda’, ‘hippopotamus’, ‘humpback whale’, ‘leopard’, ‘pig’, ‘raccoon’, ‘rat’, ‘persian cat’ and ‘seal’, consisting of 6,180 images. The training data contains 24,295 images of the other 40 classes.

We use 6 visual features provided from [1] : RGB color histograms, SIFT, rgSIFT, PHOG, SURF and local self-similarity historams. An image is represented by concatenating the 6 visual features, a 10,940 dimensional vector in total. 

We map the attributes to four ratings, “irrelevant” < “less relevant” < “relevant” < “highly relevant.” These ratings were chosen because they shows the best performance. We will investigate how the number of rating values impacts prediction accuracy.

6. Results

6.1 NCDFS settings

For the feature function, we set \(\phi(\mathbf{x}) = [1, f(\mathbf{x})]^\top \), where \(f(\mathbf{x})\) is the output of Support Vector Ordinal Regressor (SVOR) for \(\mathbf{x}\). Using SVOR can benefit from computational advantages because it is faster to use the output of SVOR than to use the raw visual features.

We first predict the probability \(P_m(r|x)\) for the mth attribute given the input image \(\mathbf{x}\)  using NCDFS. Then we estimate a class of the input \(\mathbf{x}\) which maximizes Equation (1) with the attribute-to-class mapping (attribute table).

To set the baseline performance, we use a multi-class SVM (mSVM). mSVM takes the same input as SVOR. mSVM, however, considers that all levels (ratings) are treated equivalently in contrast to SVOR where the levels are ordered.

6.2 Results

Table 1: Average accuracy for detecting animal classes for [1], mSVM and NCDFS. The multi-class accuracy is measured by the mean of the diagonal of the confusion matrix. One can find that NCDFS outperforms the other methods.

Table 2: Area Under Curve (AUC) of 10 test classes for [1], mSVM and NCDFS (%).

Figure 2: Confusion matrices between 10 test classes of the AwA dataset for [1], mSVM and NCDFS.

Figure 3: Classification accuracies for various number of ratings. Figure 3 verifies that the proposed 4 rating scale would be the optimal for the AwA classification problem based on the ordinal regression.


  • [1] C. H. Lampert, H. Nickisch, and S. Harmeling. “Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer”. In CVPR, 2009


  • Attribute Rating for Classification of Visual Objects, Jongpil Kim and VladimirPavlovic, International Conference on Pattern Recognition (ICPR), 2012. [pdf]

Discovering Characteristic Landmarks on Ancient Coins

1. Goal of the Project
For a given Roman coin image, the goal is to

  1. to automatically find visual characteristics of the coin which make it distinguishable from the others
  2. to identify Roman Imperial Coinage (RIC) label of the coin

Figure 1. Sample obverse and reverse images of two an- cient Roman imperial coins. Both coins depict the same emperor (Domitian) on the obverse side but have distinct reverse depictions, resulting in different Roman Imperial Coinage (RIC) labels. The descriptions for them are (a) obverse: Laureate head right, Reverse: Minerva standing right on capital of rostral column with spear and shield to right owl, and (b) obverse: Laureate head right, Reverse: Pegasus right.

2. Motivation and Practical Applications

  • Understanding the ancient Roman coins could serve as references to understand the Roman empire
  • A reliable and automatic method to recognize the coins is necessary as the coin market is very active and many people are collecting coins as hobby
  • Ancient coins are becoming subject to a very large illicit trade. Recognition of the ancient Roman coins is not easy for novices but requires knowledge.

3. Coin Classification Using Convolutional Neural Networks (CNNs)

  • Fine-tuning on pre-trained CNN models

    – Considering the number of the coin images in our dataset (about 4500), the CNN model is likely to be under-fitted if we train it only on the coin dataset even if we use the data augmentation method
    – We adopt one of the most popular architecture proposed by Krizhevsky et al. [2] which is pre-trained on the ImagetNet with millions of natural images

  • Hierarchical classification

    – One Roman emperor includes several RIC labels as shown in Figure 1 while one RIC label belongs to exactly one emperor. Therefore, we can build a tree structure to represent the relationship between the Roman emperors and the RIC labels as depicted in Figure 2
    – Then the final probability is defined to be the product of the probabilities on the path from the root to the leaf as \[p(e|I_o) \cdot p(r|I_r) \cdot \delta(Pa(r) = e)\] where \(Pa(r)\) is the parent of node \(r\) and \(\delta(\cdot)\) is the indicator function

    Figure 2. Hierarchical classification for the RIC label

4. Finding Characteristic Landmarks on Roman Coins

  • Characteristic region set: the smallest set of local patches to represent the identity of the full image
  • Representation of the masked image
  • \[ f_\mathbf{I}(\mathbf{x}) = \left(\sum_{k=1}^{K} x_k \cdot \mathbf{I}_k\right) \otimes \mathbf{C} \] where \(\mathbf{I}\) is an image, \(\mathbf{I}_k\) the k-th subregion of image \(\mathbf{I}\), \(x_k \,\,(0 \leq x_k \leq 1)\) a transparency weight for the k-th subregion (0: transparent, 1: full intensity), \(\mathbf{C}\) a normalization vector

  • Objective function: finding an image which consists of the smallest set of regions but still can be correctly classified
  • \[ \begin{align*} \min_\mathbf{x} & \quad \ell_c\left(f_\mathbf{I} (\mathbf{x})\right) + \lambda \mathcal{R}(\mathbf{x}) \\ \mbox{s.t.} & \quad p\left(c|f_\mathbf{I}(\mathbf{x})\right) > p\left(c|f_\mathbf{I}(\mathbf{1})\right) – \epsilon,\\ & \quad \epsilon > 0, \nonumber \end{align*} \] where \(\ell_c(\cdot)\) is the loss function of the CNN model, \(\mathcal{R}(\mathbf{x})\) a regularization function and \(\lambda\) is a hyper parameter to control the regularization, and we place the constraint so that the prediction probability of the masked image \(f_\mathbf{I}(\mathbf{x})\) may differ from the original image \(f_\mathbf{I}(\mathbf{1})\) at most \(\epsilon\)

5. Coin Data Set

  • Coin images are collected from a numismatic website
  • 4526 Roman Imperial coins with RIC 314 labels and 96 Roman emperors
  • – Annotated for visual analysis (the original dataset only has numismatic annotation)
    – Both obverse and reverse images for each coin
    – Each emperor has at least 10 coins

  • High resolution images : at least 300-by-300 pixels

6. Experimental Results

  • Classification
    – Baseline method: SVM

    Table 1. Classification Accuracies for SVM and CNN

    Figure 3. Confusion matrices of CNN and SVM

  • Finding Landmarks
  • Figure 4. Visualization of landmarks as a function of \(\epsilon\)

    Figure 5. Discovered landmarks for obverse and reverse images

7. Conclusion

In this project, we proposed a novel method to discover the characteristic landmarks of the ancient Roman imperial coins. Our method automatically finds the smallest set of the discriminative regions sufficient to represent the identity of the full image and distinguish it from other available classes.

The qualitative analysis on the visualization of the discovered regions confirm that the proposed method is able to effectively find the class-specific regions but also it is consistent with the human expert annotations. The proposed framework to identify the ancient Roman imperial coins outperforms the previous approach in the domain of the coin classification by using the hierarchical structure of the RIC labels.


  • [1] J. Kim and V. Pavlovic. “Discovering Characteristic Landmarks on Ancient Coins using Convolutional Networks”. 2015. Available:
  • [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”. NIPS. 2012