Xcalar Data Platform

An enterprise-grade data processing platform that scales linearly to 100s of nodes and 1000s of users to handle petabytes of data in seconds

Xcalar Data Platform is a scale-out platform for data processing applications and operationalizing ML. The platform is open and extensible, and suitable for developing and operationalizing complex business logic. Xcalar's use cases include virtual data warehousing to enable BI tools to query real-time transactionally consistent data, operationalizing ML algorithms at cloud-scale, as well as simplifying data transformation and quality processes. This enterprise-grade software scales linearly to hundreds of nodes, thousands of users, and petabytes of data for public/private cloud and hybrid deployments.

With Xcalar Data Platform you can significantly accelerate your business logic develop > test > operationalize cycles, working directly with your custom data lake. It provides real-time processing at cloud-scale without modifying or moving your source data.

Xcalar Data Platform Features

An enterprise-grade data processing platform that scales linearly to 100s of nodes and 1000s of users to handle petabytes of data in seconds

true-data-in-place-xcalar-data-platform
true-data-in-place-xcalar-data-platform

True Data in Place™

Xcalar Data Platform works directly with source data files without inducting data into an internal format and regardless of where the data resides. Users analyze data in its original form through metadata views that are dynamic and fluid. Data does not need to be prepped, moved or cleansed before analysis. Xcalar Data Platform was built on the premise of keeping data in its original form and still allowing relational compute. Thousands of users can use Xcalar to analyze billions of rows simultaneously, collaborate and share data pipelines with added ML routines, all with an enterprise-grade reliability and with linear scalability.

visual-programming-sql-dataflows

SQL

Xcalar Data Platform supports ANSI SQL for running queries on data. Users can write SQL using modern languages and can invoke ML algorithms from open source libraries like TensorFlow, Spark ML and H2O. Xcalar Data Platform allows users to write functions in Python, share them, reuse them and call them from within their SQL queries. Queries can then be operationalized and scheduled to run on larger volumes of data.

visual-programming-sql-dataflows
Structured-Programming-python-models-processing-data
Structured-Programming-python-models-processing-data

Structured programming

Xcalar Data Platform enables structured programming, using Python for developing models and processing data. Code written in these languages can be blended seamlessly with visual programming and SQL to create dataflows.

visual-programming-sql-python-computing-architecture

Visual programming

Xcalar Data Platform’s sophisticated visual studio and IDE, Xcalar Design, provides an intuitive graphical user interface for users to work interactively with very large diverse datasets as virtual tables within Xcalar. Users derive meaningful insights from their data using a combination of (1) SQL, (2) structured programming languages such as Python and (3) visual programming. Users can use these three paradigms interchangeably, as and when needed, to create an end-to-end dataflow graph. User operations automatically formulate dataflows that are run in parallel across Xcalar Data Platform’s distributed cluster. This powerful interface enables anyone to develop and scale sophisticated models on Xcalar Data Platform’s massively parallel computing architecture.

visual-programming-sql-python-computing-architecture
Operationalizing-machine-learning-algorithms-cloud-scale
Operationalizing-machine-learning-algorithms-cloud-scale

Operationalizing machine learning algorithms at cloud-scale

Using Xcalar Data Platform, data scientists can train and deploy supervised learning ML models to score, detect anomalies, and cluster objects across petabytes of data. Machine learning can be applied at any stage of the data pipeline and billions of classifications done in seconds. Users have the ability to do modular programming and can, from within SQL code, invoke algorithms from open source machine learning libraries like TensorFlow, Spark ML, H2O, Scikit-learn and Keras.

Separation-storage-compute

Separation of storage from compute

Xcalar Data Platform minimizes idle infrastructure by providing a clean separation of compute from storage, allowing for resources for each to be scaled independently for efficiency. This full isolation of storage and compute has enabled Xcalar to address a key problem that plagues many computing platforms. Xcalar Data Platform enables organizations to right-size resources for storage and compute, while meeting storage and processing needs for sustained and burst workloads, resulting in significant savings in the total cost of ownership.

Separation-storage-compute
linear-scalability-nodes-compute-engine
linear-scalability-nodes-compute-engine

Linear scalability

Xcalar Data Platform can provide linear scalability for hundreds of nodes, solving real-world relational problems that have complex causal relationships within the data. Each Xcalar cluster node has a compute engine that maintains a distributed shared memory space spanning the cluster. The memory architecture is designed for sharing rows, columns, tables, and matrices. Any Xcalar cluster node can access data anywhere in the cluster, thereby simplifying application programming and making scale-out automatic. Xcalar Data Platform applies MPP computing, while maintaining strong data consistency.

Additional Features

Storage-agnostic

Xcalar Data Platform works directly with source data files without inducting data into an internal format or regardless of where the data resides. Users analyze data in its original form through metadata views that are dynamic and fluid. Data does not need to be prepped, moved or cleansed before analysis. Xcalar Data Platform was built on the premise of keeping data in its original form and still allowing relational computes. Thousands of users can use Xcalar to analyze billions of rows simultaneously, collaborate and share data pipelines with added ML routines, all with an enterprise-grade reliability and with linear scalability.

Reusable dataflows

Dataflows are algorithms developed in Xcalar Data Platform by users through visual programming, SQL, and structured programming. As users perform actions, execute queries, and run code, dataflows are generated and stored as JSON files describing the metadata and operations. Users can drill down to view schemas, distribution of data across nodes and data skew. These dataflows can be reused by multiple users to enable collaboration and to increase productivity. Dataflows can be saved, operationalized, and scheduled to run on production data with a few clicks.

Data Quality

Xcalar Data Platform can read and analyze arbitrary data, including data not conforming to the expected format, encoding, type, and attributes. Visibility to exceptions is often important to have during discovery. While processing virtual tables, Xcalar saves non-conformant rows in a separate virtual table that denotes Integrity Constraint Violations (ICVs) for the operation being performed. Users can triage these at any stage of the analytic pipeline. Xcalar enables users to bind models to the data at any stage of processing and to perform a supervised or unsupervised delayed resolution of the ICVs.

Operational workload management

Batch jobs created from dataflow models run with optimized performance, whether periodically or on demand. This enables users to deploy their analytics pipeline at cloud-scale in a secure production environment. During operationalization, Xcalar Data Platform automatically applies performance optimization changes where applicable, including automatically dropping metadata and moving filter clauses up to execute earlier in the algorithm to minimize data handling downstream. Batch dataflows can be saved, shared across clusters, loaded, parameterized, and scheduled to run. Xcalar Data Platform operates in a latency-sensitive mode when interactive performance (a few seconds to tens of seconds for billions of rows) is desired. It also operates in a high-throughput mode when petabyte scale workloads require efficient processing. Xcalar Data Platform provides the flexibility to partition a cluster’s processing power to concurrently handle different kinds of workloads. Please see “Mixed mode operations”, below.

Full memory hierarchy

Xcalar Data Platform uses the entire memory hierarchy, comprising of DRAM, Storage Class Memory (SCM), and when available, flash, and disk. Users can build clusters within their budget by mixing high-performance DRAM with lower-cost memory options for an optimal price-performance ratio.

Optimized I/O performance

Xcalar Data Platform has a highly parallel read and write architecture that enables thousands of cores to perform simultaneous read and write operations on disk, network, or streaming data sources. Unlike many systems that are optimized for either reads or writes, Xcalar Data Platform delivers optimal performance for both operations.

Reliability and fault tolerance

Xcalar Data Platform runs user code, including user-defined functions (UDFs), parsers, applications, algorithms, and connectors, in containers called Xcalar Processing Units (XPUs). Within an XPU, each custom code module runs in its own address space and jeopardizes neither overall cluster performance nor stability. In case of node failure, the Xcalar Data Platform quickly recovers using session redo logs.

Extensible architecture

Xcalar is language-agnostic and enables users to write UDFs for custom data transformations and for converting data from custom formats. Xcalar Design provides a simple SDK for users to create new extensions that can be installed and shared in Xcalar Design. For example, programmers can extend Xcalar Design by writing JavaScript extensions to create new UI panels and can map the extended functionality to user-defined functions written in Python. Extensions can also invoke UDFs and the Xcalar Processing Unit functionality.

Skew management

Xcalar Design enables users to detect skew, the commonplace imbalances of data placement across nodes, which can cause an uneven utilization of memory, CPU, and network resources. Xcalar Data Platform detects and diagnoses data skew across all cluster nodes. Xcalar Design’s comprehensive dashboard both details the occurrence of data skew and provides tools for updating the user’s algorithms to avoid skew.

Backup and recovery

Xcalar Data Platform creates redo logs that allow recovery from both partial failures (such as aborted transactions) or cluster failures (such as power outage). Logs and datasets can be backed up by users on a schedule. Checkpoints are supported, which allows recovery using hot snapshots of data.

Ad hoc analytics/modeling on hundreds of billions of rows

Xcalar Data Platform enables users to perform interactive analysis on hundreds of billions of rows including applying relational operations, such as join, union, group by, pivot, filter, aggregate, sort and merge.

File format-agnostic

Xcalar Data Platform works with source data files of any format. It preserves user data in its original format without the need of converting it into a specific format for Xcalar. Xcalar Data Platform natively supports common file formats, such as CSV, JSON, Excel, XML and the Parquet file format. Xcalar Data Platform was designed from the ground up to perform relational operations on arbitrary data. Xcalar Data Platform can work with semi-structured data from various sources; such as log files, JSON documents, or IoT data; structured data, such as rows and columns; unstructured data, such as images and social media content, as well as streaming data sources such as Kafka. New file connectors can be written in Python with a simple SDK.

Lineage and auditability

Dataflows track data from source through each transformation. Dataflows include low-level details, including the path to the data source, intermediate tables created, and the operators applied to these tables. Xcalar Data Platform’s rich set of features, supported by the dataflows, allow users to drill down to schemas, distribution of data across nodes, and data skew. Dataflows are auditable, and can be both archived as graphics and exported as JSON files.

Enable BI tools to query live data (JDBC and REST)

Xcalar provides JDBC support, which enables integration with BI applications such as Tableau, Qlik, and Power BI. Xcalar virtual tables, data marts or cubes can be queried by any BI tool for visualization and reporting. Users can offload Tableau and Qlik servers when querying multidimensional cubes or data marts. Alternatively, the tables can be exported to databases, data marts or OLAP cubes, providing an end-to-end analytics platform.

Microbatch transactions

Xcalar Data Platform can handle real-time inserts, modifications and deletions coming in at microseconds intervals while maintaining transactional consistency. Xcalar Data Platform microbatch dataflows can efficiently process such complex sequences of insert, modify and delete operations on large volumes of data from multiple data sources. Users can either manually schedule microbatch dataflows or initiate them on demand through a RESTful API. Xcalar Data Platform enables users to view all the transformations performed on their data, including a timeline of inserts and updates. Users can traverse back visually to any point on this timeline.

Memory optimization

Active data is prioritized and auto-tiered higher in the memory hierarchy when Xcalar Data Platform processes workloads. As users activate or deactivate sessions, shared memory is used or immediately freed up. Also, virtual tables not immediately needed can be exported to disk and later imported, when needed. Xcalar Data Platform automatically applies memory optimization where applicable during operationalization, including automatically dropping metadata and moving filter clauses up to execute earlier in the algorithm to minimize data handling downstream. Once operationalized, batch jobs are optimized for end-to-end execution efficiency, and consume a small fraction of the memory utilized in modeling.

Mixed mode operations

Xcalar gives users the ability to partition the processing power across the entire cluster to accommodate different kinds of workloads that may be running concurrently. Xcalar can run multiple batch jobs within SLA, while also running latency-sensitive workloads that enable business users to do ad hoc analytics interactively.

Flexible deployment

Xcalar Data Platform can be deployed on-cloud, such as on Microsoft Azure, Amazon Web Services, or Google Cloud Platform, or to on-premises servers and private clouds. Xcalar Data Platform also supports hybrid deployment.

Security and control

Xcalar Data Platform provides user- and role- based access control including integration with Kerberos, Azure AD, LDAP, OAuth, and other custom authentication and user management services. Xcalar allows admins to control what can be shared within Xcalar. Through the dataflow graphs, algorithms and models can be audited and traced back to identify the source data and the owner of the source dataset.

Collaboration

Xcalar Data Platform is designed for collaboration. Teams of users can download and share their Xcalar workbooks for efficient and easy collaboration, configure datasets to be shared with other users, or read UDFs from a central repository.

Administration and system management

Xcalar Design includes performance tuning dashboards for system administrators. The dashboards advise on how to improve resource utilization, how to meet service-level objectives when administrators operationalize their workloads, and how to manage data skew within a cluster.

Use Cases

Developing and operationalizing complex business logic and ML algorithms at cloud-scale

Interactively work with petabytes of data; billions of classifications in seconds

Virtual data warehousing and ad hoc analytics

Migrate your traditional DWs to an open scale-out architecture

Data transformation and quality at petabyte scale

Work with trillion rows; process petabyte scale batch data in seconds or minutes

Live data for BI and reporting

Unlock your data lake for your analysts and data scientists