Data Applications for Large Scale Data Transformations Made Easy - Part 2



In this second blog of the Data Applications for Large Scale Data Transformations Made Easy series I want to tell you about how Xcalar improved developer productivity, reduced time to market and optimized production performance SLAs while reducing TCO (total cost of ownership) for one of our customers, a top 5 global investment bank.

Our Xcalar team has just finished production deployment of a near real-time streaming data applications platform at one of the top financial institutions. The wealth management division of this bank provides its customers with consolidated financial information, portfolio analysis, and mark-to-market valuations across all their investments. This data includes balances, historical account statements, client holdings, tax lots, gains and losses, and other details. All of this information changes frequently due to new trading transactions and corporate actions, and may even go several years back to capture the impact of delayed legal case resolutions. 

Technology Challenges with Spark

The previous Apache Spark implementation had several challenges: the bank was limited to a daily frequency of batches due to very slow performance of the Apache Spark deployment (over 4 and a half hours for a single batch). Near real-time data computations were out of the question. The skillsets required to deploy and maintain Apache Spark based implementation were hard to find and expensive. The engineers were expected to understand Spark internals, have Java expertise and be savvy in the business domain - a very hard-to-find combination. Furthermore, after having developed 25,000 lines of monolithic Java/Spark code, it became clear that it was not only expensive to get something started, but even more expensive to retain the talent needed to maintain the complex code that had been developed. 

As a result, the bank struggled with an unacceptably long time to value, with long development and troubleshooting cycles. In fact, their last revision took 2 people over 3 months to refactor - 6 person-months to just refactor! You can imagine the effort that such a new complex project would require. Finally, there was no developer IDE for distributed data applications - just Java editors that forced you to deploy to production in order to gain limited debuggability via data dumps (interspersed throughout the production code) while running in batch . All of these factors contributed to a very high TCO for Apache Spark at the bank.


Image 1. Technology Challenges with Spark

Xcalar addressed all these issues, and more. The bank’s engineers and managers fell in love with the Xcalar IDE (see Image 2 below). With Xcalar, they replaced 25,000 lines of obtuse Java Spark code with just over a dozen visually intuitive Xcalar modules. Engineers needed to know only SQL and a little bit of Python to become dangerous and create powerful distributed data applications. Xcalar accelerated time-to-value for new capabilities so new business requirements did not need to wait 3 to 6 months to be implemented - now development and testing takes just days, or, at maximum, weeks.

Xcalar is low code. No need to hand-code distributed compute - all tasks related to scaling, indexing, distribution, and orchestration are taken care of by the Xcalar engine.

Xcalar is able to extend virtual memory by leveraging SSD storage and can operate on datasets significantly larger than available physical memory with near-in-memory query latencies . As a result, batches that took 4.5 hours with Apache Spark now run in only 15 minutes using Xcalar with the same infrastructure - more than a 10x price/performance boost. Because of the higher performance, the bank could now implement near-real-time updates for their customers, unlike with the Spark deployment that required them to wait until the next day to see updates to portfolios. 

The original Apache Spark based solution was further enhanced using Xcalar’s support for streaming data. The data for dozens of data applications is fed both from the HDFS data lake and from several dozen multi-partitioned Kafka streams. Unlike Spark jobs which are basically run in batch, Xcalar data applications work with streaming data using frequent, micro-batch, live table updates. 

Implementation with Xcalar

Image 2. Data application implementation with Xcalar

Our customer estimates that Xcalar enabled 5x higher developer productivity and an overall 20x improvement in time-to-value, not to mention the 10x TCO reduction and price/performance boost.  The developer productivity improvement was mainly attributed to Xcalar’s low code programming paradigm, zero-change production deployment, and shortened operationalization and troubleshooting phases.

Image 3. Xcalar Benefits


When I was leading 50 people teams in Morgan Stanley and Goldman Sachs -  the instrumentation overload required to write and maintain data applications was really bogging down the efficiency of my team. That was a real problem for me. With Xcalar that problem is gone - engineers now are empowered to work on business logic without looking back. I no longer have to worry about infrastructure issues and sky-rocketing costs due to unoptimized configurations or wrong indexes - but instead focus on the business logic around my fraud detection, anti money laundering, mark to market, and other real business requirements.

Xcalar is for the data engineers and data analysts among you who are on the line to deliver results to business stakeholders. It is for those of you who know exactly what needs to be delivered - an insight, a data mart for visualization or machine learning, complex business logic computations and transformations, decision support systems, etc. And, you want to get to those business results immediately using the full scope of data (terabytes and beyond), not samples. At this point you probably would like to avoid the time-consuming complexities of RDD, datasets partitioning, executors and tuples of Spark and similar nuances of other low-level SDKs and APIs, where you will likely get stuck fighting infrastructure and performance optimization issues at the cost of programming business logic. 

There’s of course a certain satisfaction that comes from solving difficult, complex issues with frameworks like Spark. I get it. If I am an engineer just out of school, I would want to put Spark on my resume - I would love to go through these challenges and learn. However, this is not the future of data engineering work.  Spark is a fine tool for data science, however, for compute intensive data transformations it has needless complexity. If you want to free yourself from low-level minutia and focus on business logic to solve real world business problems, then use the enterprise-grade system that is guaranteed to make you more productive and also outperform Spark. If you are interested, please request a free trial of Xcalar Cloud Developer.

In the next and final part of this blog series, we will go over a data engineering experience building a small data application and an in-memory data mart. I will show how one can easily build a complex data application with Xcalar with a simple example of log processing.