January 21, 2018

Rethink Your Data Pipeline

Enterprises that seek a competitive edge through analytics will typically get data from many sources, such as internal systems, partners, and third-party providers. The variety of data formats, semantics, timeliness, and quality can make analytics difficult. There is work that is needed to adapt, fix, and blend the data before it can be used.

The Big Trade-off with Analytics

For many enterprises who have aggressively adopted analytics, this initial data preparation step is considered an isolated task performed by data engineering specialists. Many companies have adopted this view for different reasons. The main business drivers that we’ve seen are to improve work throughput for this discrete step and to leverage existing organizational resources. This can help scale the overall throughput for the pipeline, but typically at the cost of making iterative, ad hoc analytics difficult. Dependencies across organizational groups are manageable when the work is smooth and consistent, but prohibitive when doing something new that requires many adjustments and repeated attempts.

Businesses often want to answer questions or seek to leverage data that requires new data sources and/or data prepared in a new way. For this purpose, many enterprises are now re-thinking their analytics pipeline for better collaboration and tool sharing to enable business analysts to drive the analytics pipeline.

Xcalar Data Platform Democratizes Advanced Analytics

One of Xcalar's customers, who is a global leader in data storage, analyzes telemetry data from hundreds of thousands of servers deployed around the world. The telemetry data is stored in a Hadoop data lake and subsets are extracted and prepared for analysis by data scientists, product managers, and operations analysts. This telemetry data is critical for understanding how their products are performing in the field and for planning new products, features, and services to better meet their customers’ needs. The data comes from multiple and varying hardware models and is stored in different file formats.

To make this data available for analysis, an internal IT group had been compiling and preparing data from specifications defined by their end-users -- mainly senior management, product teams, and customer support staff. Many challenges were overcome to get this process started, including setting up a large library of HIVE scripts that build datasets from their data lake. But this process was slow, resource intensive, and poorly suited for supporting new workloads on demand from their business staff.

After transitioning from HIVE to Xcalar Data Platform, their data engineering team was able to extract and prepare datasets at least 10 times faster on 93% fewer servers, with other benefits including data lineage and better algorithm re-use.  Also, the quality of the data for analysis improved because the modelers on this team could now visually inspect, profile, and fix large datasets that can grow to hundreds of millions of rows.

The most significant improvement in their analytics capabilities occured when their business analysts, who were also using Xcalar Data Platform for their downstream analytics, discovered they could access the same data sources and algorithmic models set up by their IT team. By working on a single, highly scalable platform with an simple GUI, business analysts were able to directly drive implementation of ad hoc analytics workloads from beginning to end.

Preparing Datasets 10 Times Faster Using Fewer Servers

At Xcalar, we view data preparation, data science, and analytics as solutions built upon our scale-out relational compute platform. For data preparation, our visual programming tool, called Xcalar Design, includes built-in functions for statistical profiling, detection of misspellings and formatting errors, a business rules engine to isolate data exceptions, and more. For data scientists, Xcalar Design includes the Jupyter Notebook IDE for applying popular open data science libraries such as Numpy, Pandas, and TensorFlow at any point in the pipeline. Xcalar Design also enables non-programmer users, such as business analysts, to interactively develop algorithmic models and apply machine learning operations on petabytes of operational data.

We're excited to see customers learning how to best leverage and utilize the Xcalar platform to accomplish their analytics goals. When seeking actionable insights from your data, the speed of iterative experimentation is correlated to the quality of the outcome. For ad hoc workloads, business analysts can now drive the analytics pipeline by sharing tools that were previously in the domain of data engineering and data science

To get your free trial on Xcalar Data Platform click here .
Joseph Yen