Data Lineage And Auditability For Retail Operations


The retail operations division of a major financial services firm needed to improve the efficiency of their data preparation processes. Most of the analyses, real-time management decisions, and reporting functions of their enterprise depend on fast and reliable assimilation of data from hundreds of sources and systems. The rapid growth of variety and volume in their data has created further operational efficiency challenges.

Many compelling challenges existed, including the following:
  • Inability of business users and data scientists to achieve self-sufficiency and serve themselves
    • Many tools require programming skills beyond those of most end users.
    • Business users and Data Scientists must request IT support through a ticket system.
      • Long delays often result due to loose coupling of teams, capacity issues, and geographic separation.
      • Requests are often misunderstood or implemented incorrectly.
    • Analysts and users cannot be proactive, or nimble enough to change course quickly.
  • Too many tools, too many layers, too many players
    • A wide array of tools results in a lack of consistency, uniformity, and universality.
    • Overlapping tools compete with each other.
    • The spectrum of technical skills is too broad. Skills are too specialized and fragmented.
    • Lineage and traceability of data transformation becomes lost in the machinery.
      • Each step of the process is handled independently, lacking transparency and detailed documentation.
      • Each tool functions slightly differently and does not share information in a standard fashion.
  • Processing capacities and availability of prepared data constrain the business
    • Current data volumes are already reaching their limits, with processing often consuming days or weeks.
    • Future data growth is inevitable, which will further intensify this issue.
    • Several current technologies appear to be at their limits, unable to scale naturally.
  • Varieties of data formats are increasing at a rapid rate.
    • New data types and data sources are emerging, and they are very hard to integrate.
    • Existing tools have limitations with semi-structured/unstructured data.
    • Conversion efforts are highly manual and not fully integrated or operationalized.
    • Extensibility to yet-to-be-known formats is not predictable.

The Xcalar Solution

Xcalar undertook a pilot project to demonstrate its ability to overcome the data preparation problems described above with a scalable, extensible, and integrated solution. Use cases were incorporated which represented a broad spectrum of user constituencies, skill levels, and data formats. Users from all skill levels participated in the testing, training, and rollout.

Four scenarios were tested:

(1) Large volume data exploration, data cleansing and integration of multiple data formats and data sources
(2) Operationalized extraction, refinement, and distribution of raw data from an HDFS data lake
(3) Streaming intra-day analysis of equity transactions for margin analysis, compliance audit, and fraud detection
(4) Data volume scale-up to prove future headroom and process to expand Xcalar resources

Xcalar was installed on two dedicated bare-metal clusters on premises:

(1) one cluster for modeling and data analysis
(2) one cluster for the recurring execution of operationalized dataflows.

Use Case #1: Large-volume data wrangling and blending

Diverse multi-terabyte datasets of XML, JSON, and delimited text data were analyzed and blended along with Hive/Parquet data. Some datasets were imported directly into Xcalar using out-of-the-box connectors, while others leveraged Xcalar extensibility via Python import user-defined functions. These simple, custom functions allowed the import of complex data for all users to explore and analyze. Non-technical business users in multiple geographic locations quickly learned how to visually explore data, construct dataflows, and generate batch dataflows, by using the Xcalar Design visual programming interface.

Example:  data preparation workflow for daily CUSIP and ISIN audit:
  • Data sources involved:
    • list of CUSIPs to be audited
    • list of ISIPs to be audited
    • current daily trade data from 7 subsidiaries, multiple source systems and data formats
    • historical trade data for rolling 36 months from an HDFS data lake
  • Algorithm creation time comparison:
    • Existing methods: 1 to 4 days
    • Xcalar:  3 hours for initial dataflow, 15 seconds to parameterize for subsequent executions
  • Algorithm execution time comparison:
    • Existing methods: 3.75 to 6 hours
    • Xcalar: 4 to 11 minutes

Use Case #2: Operationalized extraction and refinement of raw data from an HDFS data lake

Raw data had accumulated in various HDFS data lakes for more than two years. However, HDFS has been a one-way data resource, in which data populates the system but very little data is retrieved or used as an output. This PoC scenario took HDFS data archives and increased their accessibility, utilization, and  productivity. Vital reports and analyses in Qlik, Tableau, and Excel were re-written to leverage data from Xcalar, which imported data directly from HDFS. Users constructed batch dataflows using Xcalar native HDFS connectors. Dataflows were constructed to perform data refining and preparation, then exported to the various reporting tools. The architectural change achieved the following outcomes:
  • Accelerated report delivery schedules and timelines
  • Eliminated resource demands on production transactional systems for reporting
  • Addition of comprehensive tracing of  lineage from source to output

Use Case #3: Intra-day streaming transaction analysis

The trade operations team had searched for a data delivery platform for intra-day equity data. Position analysis and portfolio inventory are vital yet constantly changing indicators. Xcalar batch dataflows were constructed to import micro-batches of transaction data at five-second intervals and construct time-series data for visualization in Tableau. Additional hourly retrospective data packages were also made available for consumption by compliance and risk-management groups.

Use Case #4: Data Volume Scale-Up

Two forms of testing were performed to demonstrate Xcalar’s ability to scale:
  • Resource scale-up testing: Xcalar performed data preparation for three years’ worth of trade data volume on the operational cluster. CPU and memory resources were scaled on clusters with node counts of 4, 8, and 16 on premises, followed by a cloud-based test of 16-, 32-, and 64-node clusters (using masked data). Scale testing demonstrated that Xcalar was able to achieve within 10% of linear speed-up by the addition of resources up to the cluster size of 64 nodes.
  • Data scale-up testing executed on cloud infrastructure (16, 32, 64 nodes):  base dataset of 3 years was synthetically projected to 6 and 9 years to test the effect of data scale-up. Xcalar was able to demonstrate 94% linearity of elapsed time when comparing 3 years to 9 years of data volume on 64 nodes.

Key Benefits of Xcalar Technology

During the pilot deployment project, Xcalar was able to demonstrate the following:
  • Faster results than existing data preparation tools/solutions
    • Quicker algorithm development and data exploration
    • Faster execution against production data
  • A comprehensive, single integrated solution for data preparation
  • Greater usability by non-technical users, particularly by exploiting the visual features of Xcalar Design
    • Self- sufficiency of many of the non-technical business users
  • Ability to extend the Xcalar framework to arbitrary data types via user-defined functions
  • Scalability into the future for data volumes beyond today’s levels
  • HDFS connectivity and coexistence with the Hadoop ecosystem
  • Storage cost savings driven by Xcalar’s True Data in Place™ architecture and its elimination of redundant data storage