Problem Definition
Financial decisions with respect to investing in industry based indices are often based on heuristics and non-standard methods or purely based on the company specific algorithmic methods. In-depth analysis of the historic stock market behavior and dynamics among different industries are critical for predicting the future trading outcomes. It is also important to identify which companies’ stock prices are leading or lagging and perform in a similar trend to other companies, so that we can identify groups of companies which behave similarly in a certain time period for better investment decision making.
Our Approaches
We are working on methods to enhance the applicability of time series analysis on historic stock market related data to identify specific groupings of companies with similar patterns/behavior and also verify applicability to GICS identification (Global Industry Classification Standard). The GICS sectors are defined as Consumer Discretionary, Consumer Staples, Energy, Financials, Health Care, Industrials, Information Technology, Materials, Telecommunication Services and Utilities. We have worked on a variety of different machine learning and probabilistic methods as descibed below.
- Finding similarities between time series sequences using sting kernel matching
The time series sequences of historic stock prices are represented by string sequences (after taking a sliding window based approach) defined from a finite alphabet and then we use the method proposed by Pavel et al [1] to find similarities between different string sequences using mismatch kernels. Here we only use a local mismacth kernel, where we find the simlarity between pair of strings within a specified time lag (for example within 2 weeks) unlike the global mistmatch kernel where the similarity is found between all pairs of possible strings, because in financial domain the impact of one time series to another is short term since the stock market is efficient and the longer term impact would be minimal. Then we do clustering using Affinity propogation algorithm to find similar pattern representing companies/tickers and see how well we are performing in accordance to the GICS classification standard and one set of results we have obtained is shown in the following table.

- Granger Causality based analysis
Granger causality which is a statistical technique introduced by Nobel prize winner, Clive Granger to find whether a given time series has a causal relationship with another time series. We used this statistical hypothesis to test how different historic time series sequences of one company/ticker is affected by the lagged-time sequences of other companies/tickers. We performed this analysis for companies/tickers within and between different industry sectors (as defined by GICS classification) with different thresholds of statistical significance levels and analyzed how the resulting causal graphs vary over time. This type of analysis led to the idea of looking at time varying graphs[2] which identifies how the causal relationships between the companies/tickers change over time. The following graph shows the number of granger causal links within each industry sector for a specific time period in concern.
- Sparse Regression based analysis
We have also tried Lasso regression on modelling the linear relationship between a single ticker/company’s time series with respect to other tickers’/companies’ time series. This analysis was important to find a sparse representation of the relationship between tickers within the same sector and between different sectors. The following bar graph shows the within sector links and between sector links distribution after using lasso regression with an appropriate penalty parameter set after validation set of time sequences.
- Random Graph based analysis
We are also interested in looking at how we can model the relationship between different time series sequences using random graphs. We have tried some experiments using Exponential Random Graph (ERGM) based models and how we can model these financial time series using a set of network parameters such as density, number of mutual edges, Number of triangles, etc which indirectly controls the structure of the graphs and how sparse/dense they are. This type of analysis can also be used to model how the graphs and their parameters their structure changes over time, resulting in a dynamic graph analysis methods which could discover important links between companies and how they change over time.
References
[1] Kuksa,Huang & Pavlovic, Scalable Algorithms for String Kernels with Inexact Matching, Neural Information Processing Systems 2008 (NIPS 2008)
[2] Kolar, Ahmed, Xing, Estimating Time Varying Networks, 2010, The Annals of Applied Statistics, Vol. 4, No. 1, 94–123