Data Science for Global Equity Research


Many global financial advisory providers are moving towards a common analytics platform for researchers that assists in the core business of making decisions on investment choices for their financial customers. In general, an equity research platform must provide modeling of performance and risk spanning several thousand global securities across multiple dimensions of analysis and time.

In June 2017, Xcalar successfully completed a project for the Analytics Research Technology division of a large financial services firm that manages several global equity benchmarks.

Traditionally, each line-of-business team at the firm maintained its own analytics tools that accessed production data from local replicas. This made it difficult to provide a common set of analysis and reporting services. Although several products had been evaluated, none were able to satisfy the disparate requirements of the various personas involved:
  • Data Scientists
    • Use Matlab on small client machines for prototyping, but only a small subset of the financial data sets can be worked on.
    • Want full control over the algorithm, and want to work on full-volume data at the source rather than choosing what to upload.
    • Have 10 or more concurrent ideas to explore, but Matlab can only solve 2. When they try Spark/Scala, the learning curve is so steep that they can keep track of no more than 2 ideas.
  • Developers
    • Write programs in R, Python or Scala to promote models into batch jobs that can be provided as reports that run periodically or on-demand. These are the reports that non-technical users rely on to deliver products.
    • Turnaround time to fixing and re-running a model with Hadoop/Spark jobs can be days, making iterations very expensive and limiting the ability to innovate.
  • Product Managers
    • Want push-button operations to run reports and get insights. Do not want any programming.
  • Vast spectrum of users between Data Scientists and Product Managers, who are dissatisfied with current offerings but wary of losing existing functionality.

The Problem: 25-Year Daily Earnings-Price ratio for Global Securities

Xcalar was tasked with calculating Earnings-Price (E/P) ratio, a fundamental descriptor in equity markets. There are various ways to express this ratio ranging from earnings-per-share divided by price, taking float adjustment into consideration, and sampling time per year, per quarter or point-in-time. This was the E/P definition used by the analytics research technology team for their equity research platform:
Earnings-Price (E/P) = Latest Year-End Earnings in USD / Current Market Capitalization in USD
To compute E/P, Xcalar analyzed 3 publicly available datasets:
  1. Earnings: 12-Month Year-End Earnings
    [Security_ID, 12_Month_Earnings, Currency_ID, Business_Date, Publication_Date]
    • Year-End earnings for global securities, quoted in local currency.
    • Business_Date is the fiscal period reporting date for the company.
    • Publication_Date is when earnings were actually published to the market.
    • Time period is about 25 years: Jan 1992 to May 2017.
  2. Market Cap: Daily Market Capitalization in US Dollars
    [Security_ID, Business_Date, Value, Currency_ID]
    • Daily market capitalization value for global securities, quoted in US dollars. The time period is about 52 years (Jan 1965 to May 2017).
  3. Foreign Exchange: Daily Currency Conversion to USD
    • Daily rates over 25 years for all global currencies.
The analysis is made complex by the inclusion of multiple time-series dimensions. Also, the reporting frequency for Earnings and Market Cap is different, so a simple join operation will simply not work. Here are some elements that the solution had to account for:
  • Earnings observations are point in time, but income statements can be restated multiple times. For example, AT&T may issue results for fiscal year 2016 in January 2017 and then a more accurate version a month after. The market consumes these 2 results as they are published. The latest earnings observation closest to the current analysis date must be used.
  • The required analysis dates for each E/P result for each stock spans all business days from January 1992 to May 2017.
  • Income statements and market capitalization are issued in different currencies.

The Xcalar Solution: High Performance Fuzzy Joins between multiple Time-Series dimensions

Data scientists had implemented the business rules for E/P using Matlab subroutines. Our approach was to use the Xcalar Design visual studio to implement Matlab code as a series of in-built filters, aggregates, sorts, joins and projections. Where the standard operators would not suffice, we wrote user-defined functions (UDFs) in Python and incorporated them as scalable Map operators.

Xcalar actualized a fuzzy join between the Market Cap and Earnings datasets, to account for multiple time granularities. In other words, a single Earnings Publication_Date must match to a set of daily Market Cap Business_Date rows that are within the visible time range. This is the notion of a nearest match, rather than an exact match.

Xcalar was able to extend basic equi-join technology by enabling UDFs written in Python to specify the rules for joining Market Cap and Earnings rows. The UDFs were relatively simple, but the flexibility to integrate them into a Xcalar dataflow (algorithm) without compromising performance and scalability was a key requirement.

Here are some salient features of the Xcalar solution that were validated by the customer:
  • Data cleansing was required to identify and remove a small amount (less than 0.01%) of erroneous data that had been introduced into the public datasets. Xcalar’s profiling and correlation capabilities made these data errors evident.
  • Filters were introduced early in the algorithm to scrub data that would not figure in the final result. For example, Market Cap data from 1965 to 1992 was removed because the earliest Earnings data was from January 1992.
  • Correctness was confirmed at each step of the dataflow by comparing row counts to complementary tables and running group-by operations. Xcalar’s ability to undo and revert allowed for this verification to be done without impacting the main algorithm.
  • Every step of the algorithm was reviewed with market analytics engineers using Xcalar’s dataflow graph visualization. This involved examining intermediate results, data lineage and cardinality to ensure that there were no data anomalies.
  • The E/P algorithm was implemented based on the Matlab pseudo-code iteratively. Xcalar’s undo, revert, and redo capabilities were vital to being able to try out different methods, check result sets for validity and then either backtrack or move forward.
  • The E/P modeling dataflow had been constructed and refined on a subset of the datasets to ensure correctness and completeness. With a few clicks, that model was saved and executed as a batch dataflow against the full dataset, optimized to run in a smaller memory footprint but with maximum throughput.
  • Batch dataflows were parameterized with a few clicks to run the model against different time periods: 1992-present, 2000-present, 2016 only.
  • E/P Jobs in Production could be run on-demand, scheduled using Xcalar Design, or by a system scheduler (for example., cron, Autosys).
The following table is a summary of the results for the Earnings-Price ratio computation for various analysis date ranges. Xcalar Data Platform was installed on commodity Amazon Web Services (AWS) EC2 compute clusters.
  • Xcalar demonstrated strong scale-out for the E/P algorithm on the large result sets. The same batch dataflow was run without modification or data placement in all 3 clusters.
  • Elapsed times, measured with a commercial relational database (RDBMS) and Spark were 3x-5x higher at the 2-node level.
  • Scale-out to larger node counts was not an option for the commercial RDBMS, where clustering was a high availability solution and not for performance.
  • Scaling Spark to 4 nodes for performance would have required careful data sharding, which we did not invest the time and effort to attempt.
The time taken to develop, test, and iterate towards the final algorithm was prohibitively long with other technologies. With Xcalar Design, a preliminary working algorithm was completed in a few hours on a subset of data with visual programming. The focus then moved quickly to verifying correctness using intermediate results in Xcalar’s dataflow graph. Operationalizing the algorithm on the full set of Earnings and Market Cap data took only a few keystrokes.

Demonstrated Key Benefits of Xcalar Technology

  1. Data Access without Movement
    • Xcalar can be deployed near the data source, on-prem or in the cloud. For this particular project, data source files were located on Amazon S3 to be in close proximity to the Amazon EC2 compute cluster running the Xcalar Data Platform.
    • For the production deployment, Xcalar will be installed near the production server with direct access to the enterprise data lake.
  2. Performance and Scalability
    • Write programs in R, Python or Scala to promote models into batch jobs that can be provided as reports that run periodically or on-demand. These are the reports that non-technical users rely on to deliver products.
    • Turnaround time to fixing and re-running a model with Hadoop/Spark jobs can be days, making iterations very expensive and limiting the ability to innovate.
  3. Transparency
    • Clear auditability and data lineage showing the sequence of transformations and intermediate results for datasets in the petabyte range.
  4. Operationalize Models on Full Datasets for Production Scale
    • A modeling algorithm can be immediately saved as a batch dataflow that can run on the production cluster against full datasets.
    • The batch dataflow is optimized for end-to-end execution efficiency and consumes a fraction of the memory resources utilized in modeling.
    • Models developed on desktop tools like Matlab need to be recoded in Scala to run on Spark clusters. As we discovered, without carefully distributing datasets between cluster nodes, Spark performance degrades as nodes are added.
We are excited that the promise of visual programming with Xcalar Design and scale-out relational compute with the Xcalar Compute Engine has been proven with a concrete use case in the computational market analytics space.

The Xcalar journey into fundamental problems for equity research has just begun. Problems we are looking at for customers in the equity research segment include Conditional Sparse Matrices and Expectation Maximization using K-Means clusters.