Xcalar Data Platform is a scale-out platform for data processing applications and operationalizing ML. The platform is open and extensible, and suitable for developing and operationalizing complex business logic. Xcalar's use cases include virtual data warehousing to enable BI tools to query real-time transactionally consistent data, operationalizing ML algorithms at cloud-scale, as well as simplifying data transformation and quality processes. This enterprise-grade software scales linearly to hundreds of nodes, thousands of users, and petabytes of data for public/private cloud and hybrid deployments.
With Xcalar Data Platform you can significantly accelerate your business logic develop > test > operationalize cycles, working directly with your custom data lake. It provides real-time processing at cloud-scale without modifying or moving your source data.
Xcalar Data Platform Features
An enterprise-grade data processing platform that scales linearly to 100s of nodes and 1000s of users to handle petabytes of data in seconds
True Data in Place™
Xcalar Data Platform works directly with source data files without inducting data into an internal format and regardless of where the data resides. Users analyze data in its original form through metadata views that are dynamic and fluid. Data does not need to be prepped, moved or cleansed before analysis. Xcalar Data Platform was built on the premise of keeping data in its original form and still allowing relational compute. Thousands of users can use Xcalar to analyze billions of rows simultaneously, collaborate and share data pipelines with added ML routines, all with an enterprise-grade reliability and with linear scalability.
Xcalar Data Platform supports ANSI SQL for running queries on data. Users can write SQL using modern languages and can invoke ML algorithms from open source libraries like TensorFlow, Spark ML and H2O. Xcalar Data Platform allows users to write functions in Python, share them, reuse them and call them from within their SQL queries. Queries can then be operationalized and scheduled to run on larger volumes of data.
Xcalar Data Platform enables structured programming, using Python for developing models and processing data. Code written in these languages can be blended seamlessly with visual programming and SQL to create dataflows.
Xcalar Data Platform’s sophisticated visual studio and IDE, Xcalar Design, provides an intuitive graphical user interface for users to work interactively with very large diverse datasets as virtual tables within Xcalar. Users derive meaningful insights from their data using a combination of (1) SQL, (2) structured programming languages such as Python and (3) visual programming. Users can use these three paradigms interchangeably, as and when needed, to create an end-to-end dataflow graph. User operations automatically formulate dataflows that are run in parallel across Xcalar Data Platform’s distributed cluster. This powerful interface enables anyone to develop and scale sophisticated models on Xcalar Data Platform’s massively parallel computing architecture.
Operationalizing machine learning algorithms at cloud-scale
Using Xcalar Data Platform, data scientists can train and deploy supervised learning ML models to score, detect anomalies, and cluster objects across petabytes of data. Machine learning can be applied at any stage of the data pipeline and billions of classifications done in seconds. Users have the ability to do modular programming and can, from within SQL code, invoke algorithms from open source machine learning libraries like TensorFlow, Spark ML, H2O, Scikit-learn and Keras.
Separation of storage from compute
Xcalar Data Platform minimizes idle infrastructure by providing a clean separation of compute from storage, allowing for resources for each to be scaled independently for efficiency. This full isolation of storage and compute has enabled Xcalar to address a key problem that plagues many computing platforms. Xcalar Data Platform enables organizations to right-size resources for storage and compute, while meeting storage and processing needs for sustained and burst workloads, resulting in significant savings in the total cost of ownership.
Xcalar Data Platform can provide linear scalability for hundreds of nodes, solving real-world relational problems that have complex causal relationships within the data. Each Xcalar cluster node has a compute engine that maintains a distributed shared memory space spanning the cluster. The memory architecture is designed for sharing rows, columns, tables, and matrices. Any Xcalar cluster node can access data anywhere in the cluster, thereby simplifying application programming and making scale-out automatic. Xcalar Data Platform applies MPP computing, while maintaining strong data consistency.
Xcalar Data Platform works directly with source data files without inducting data into an internal format or regardless of where the data resides. Users analyze data in its original form through metadata views that are dynamic and fluid. Data does not need to be prepped, moved or cleansed before analysis. Xcalar Data Platform was built on the premise of keeping data in its original form and still allowing relational computes. Thousands of users can use Xcalar to analyze billions of rows simultaneously, collaborate and share data pipelines with added ML routines, all with an enterprise-grade reliability and with linear scalability.
Dataflows are algorithms developed in Xcalar Data Platform by users through visual programming, SQL, and structured programming. As users perform actions, execute queries, and run code, dataflows are generated and stored as JSON files describing the metadata and operations. Users can drill down to view schemas, distribution of data across nodes and data skew. These dataflows can be reused by multiple users to enable collaboration and to increase productivity. Dataflows can be saved, operationalized, and scheduled to run on production data with a few clicks.
Xcalar Data Platform can read and analyze arbitrary data, including data not conforming to the expected format, encoding, type, and attributes. Visibility to exceptions is often important to have during discovery. While processing virtual tables, Xcalar saves non-conformant rows in a separate virtual table that denotes Integrity Constraint Violations (ICVs) for the operation being performed. Users can triage these at any stage of the analytic pipeline. Xcalar enables users to bind models to the data at any stage of processing and to perform a supervised or unsupervised delayed resolution of the ICVs.
Operational workload management
Batch jobs created from dataflow models run with optimized performance, whether periodically or on demand. This enables users to deploy their analytics pipeline at cloud-scale in a secure production environment. During operationalization, Xcalar Data Platform automatically applies performance optimization changes where applicable, including automatically dropping metadata and moving filter clauses up to execute earlier in the algorithm to minimize data handling downstream. Batch dataflows can be saved, shared across clusters, loaded, parameterized, and scheduled to run. Xcalar Data Platform operates in a latency-sensitive mode when interactive performance (a few seconds to tens of seconds for billions of rows) is desired. It also operates in a high-throughput mode when petabyte scale workloads require efficient processing. Xcalar Data Platform provides the flexibility to partition a cluster’s processing power to concurrently handle different kinds of workloads. Please see “Mixed mode operations”, below.
Full memory hierarchy
Xcalar Data Platform uses the entire memory hierarchy, comprising of DRAM, Storage Class Memory (SCM), and when available, flash, and disk. Users can build clusters within their budget by mixing high-performance DRAM with lower-cost memory options for an optimal price-performance ratio.
Optimized I/O performance
Xcalar Data Platform has a highly parallel read and write architecture that enables thousands of cores to perform simultaneous read and write operations on disk, network, or streaming data sources. Unlike many systems that are optimized for either reads or writes, Xcalar Data Platform delivers optimal performance for both operations.
Reliability and fault tolerance
Xcalar Data Platform runs user code, including user-defined functions (UDFs), parsers, applications, algorithms, and connectors, in containers called Xcalar Processing Units (XPUs). Within an XPU, each custom code module runs in its own address space and jeopardizes neither overall cluster performance nor stability. In case of node failure, the Xcalar Data Platform quickly recovers using session redo logs.
Xcalar Design enables users to detect skew, the commonplace imbalances of data placement across nodes, which can cause an uneven utilization of memory, CPU, and network resources. Xcalar Data Platform detects and diagnoses data skew across all cluster nodes. Xcalar Design’s comprehensive dashboard both details the occurrence of data skew and provides tools for updating the user’s algorithms to avoid skew.
Backup and recovery
Xcalar Data Platform creates redo logs that allow recovery from both partial failures (such as aborted transactions) or cluster failures (such as power outage). Logs and datasets can be backed up by users on a schedule. Checkpoints are supported, which allows recovery using hot snapshots of data.
Ad hoc analytics/modeling on hundreds of billions of rows
Xcalar Data Platform enables users to perform interactive analysis on hundreds of billions of rows including applying relational operations, such as join, union, group by, pivot, filter, aggregate, sort and merge.
Xcalar Data Platform works with source data files of any format. It preserves user data in its original format without the need of converting it into a specific format for Xcalar. Xcalar Data Platform natively supports common file formats, such as CSV, JSON, Excel, XML and the Parquet file format. Xcalar Data Platform was designed from the ground up to perform relational operations on arbitrary data. Xcalar Data Platform can work with semi-structured data from various sources; such as log files, JSON documents, or IoT data; structured data, such as rows and columns; unstructured data, such as images and social media content, as well as streaming data sources such as Kafka. New file connectors can be written in Python with a simple SDK.
Lineage and auditability
Dataflows track data from source through each transformation. Dataflows include low-level details, including the path to the data source, intermediate tables created, and the operators applied to these tables. Xcalar Data Platform’s rich set of features, supported by the dataflows, allow users to drill down to schemas, distribution of data across nodes, and data skew. Dataflows are auditable, and can be both archived as graphics and exported as JSON files.
Enable BI tools to query live data (JDBC and REST)
Xcalar provides JDBC support, which enables integration with BI applications such as Tableau, Qlik, and Power BI. Xcalar virtual tables, data marts or cubes can be queried by any BI tool for visualization and reporting. Users can offload Tableau and Qlik servers when querying multidimensional cubes or data marts. Alternatively, the tables can be exported to databases, data marts or OLAP cubes, providing an end-to-end analytics platform.
Xcalar Data Platform can handle real-time inserts, modifications and deletions coming in at microseconds intervals while maintaining transactional consistency. Xcalar Data Platform microbatch dataflows can efficiently process such complex sequences of insert, modify and delete operations on large volumes of data from multiple data sources. Users can either manually schedule microbatch dataflows or initiate them on demand through a RESTful API. Xcalar Data Platform enables users to view all the transformations performed on their data, including a timeline of inserts and updates. Users can traverse back visually to any point on this timeline.
Active data is prioritized and auto-tiered higher in the memory hierarchy when Xcalar Data Platform processes workloads. As users activate or deactivate sessions, shared memory is used or immediately freed up. Also, virtual tables not immediately needed can be exported to disk and later imported, when needed. Xcalar Data Platform automatically applies memory optimization where applicable during operationalization, including automatically dropping metadata and moving filter clauses up to execute earlier in the algorithm to minimize data handling downstream. Once operationalized, batch jobs are optimized for end-to-end execution efficiency, and consume a small fraction of the memory utilized in modeling.
Mixed mode operations
Xcalar gives users the ability to partition the processing power across the entire cluster to accommodate different kinds of workloads that may be running concurrently. Xcalar can run multiple batch jobs within SLA, while also running latency-sensitive workloads that enable business users to do ad hoc analytics interactively.
Xcalar Data Platform can be deployed on-cloud, such as on Microsoft Azure, Amazon Web Services, or Google Cloud Platform, or to on-premises servers and private clouds. Xcalar Data Platform also supports hybrid deployment.
Security and control
Xcalar Data Platform provides user- and role- based access control including integration with Kerberos, Azure AD, LDAP, OAuth, and other custom authentication and user management services. Xcalar allows admins to control what can be shared within Xcalar. Through the dataflow graphs, algorithms and models can be audited and traced back to identify the source data and the owner of the source dataset.
Xcalar Data Platform is designed for collaboration. Teams of users can download and share their Xcalar workbooks for efficient and easy collaboration, configure datasets to be shared with other users, or read UDFs from a central repository.
Administration and system management
Xcalar Design includes performance tuning dashboards for system administrators. The dashboards advise on how to improve resource utilization, how to meet service-level objectives when administrators operationalize their workloads, and how to manage data skew within a cluster.