Xcalar Cloud Enterprise

A data processing platform that scales linearly to 100s of nodes and 1000s of users to handle petabytes of data in seconds

Xcalar Cloud Enterprise is a scale-out platform for data processing applications and operationalizing ML. The product is open and extensible, and suitable for developing and operationalizing complex business logic. Xcalar's use cases include virtual data warehousing to enable BI tools to query real-time transactionally consistent data, operationalizing ML algorithms at cloud-scale, as well as simplifying data transformation and quality processes. This enterprise-grade software scales linearly to hundreds of nodes, thousands of users, and petabytes of data for public/private cloud and hybrid deployments.

With Xcalar Cloud Enterprise you can significantly accelerate your business logic develop > test > operationalize cycles, working directly with your custom data lake. It provides real-time processing at cloud-scale without modifying or moving your source data.

Xcalar Cloud Enterprise Features

A data processing platform that scales linearly to 100s of nodes and 1000s of users to handle petabytes of data in seconds

True Data in Place™

  • Works directly with source data files without inducting data into an internal format and regardless of where the data resides.
  • Analyze data in its original form through metadata views that are dynamic and fluid.
  • Data does not need to be prepped, moved or cleansed before analysis.
  • Thousands of users can use Xcalar to analyze billions of rows simultaneously, collaborate and share data pipelines with added ML routines.

Visual programming

  • Sophisticated visual studio and IDE provides an intuitive graphical user interface.
  •  SQL plus structured programming languages such as Python and  visual programming.
  • User operations automatically formulate dataflows that are run in parallel across Xcalar’s distributed cluster.
  • Powerful interface enables anyone to develop and scale sophisticated models on Xcalar’s massively parallel computing architecture.

SQL

  • Supports ANSI SQL for running queries on data.
  •  Write SQL using modern languages and invoke ML algorithms from open source libraries like TensorFlow, Spark ML and H2O.
  • Write functions in Python, share them, reuse them and call them from within their SQL queries.
  • Queries can then be operationalized and scheduled to run on larger volumes of data.

Linear scalability

  • Provides linear scalability for hundreds of nodes, solving real-world relational problems that have complex causal relationships within the data.
  • Each Xcalar cluster node has a compute engine that maintains a distributed shared memory space spanning the cluster.
  • The memory architecture is designed for sharing rows, columns, tables, and matrices.
  • Any cluster node can access data anywhere in the cluster, thereby simplifying application programming and making scale-out automatic.
  • Applies MPP computing, while maintaining strong data consistency.

Separation of storage from compute

  • Minimizes idle infrastructure by providing a clean separation of compute from storage.
  • Resources for each can be scaled independently for efficiency.
  • Right-size resources for storage and compute, while meeting storage and processing needs for sustained and burst workloads,
  • Achieve significant savings in the total cost of ownership.

Operationalizing machine learning algorithms at cloud-scale

  • Data scientists can train and deploy supervised learning ML models to score, detect anomalies, and cluster objects across petabytes of data.
  • Machine learning can be applied at any stage of the data pipeline and billions of classifications done in seconds.
  • Invoke algorithms within SQL code, from open source machine learning libraries like TensorFlow, Spark ML, H2O, Scikit-learn and Keras.

Additional Features

Storage-agnostic

Xcalar Cloud Enterprise works directly with source data files without inducting data into an internal format or regardless of where the data resides. Users analyze data in its original form through metadata views that are dynamic and fluid. Data does not need to be prepped, moved or cleansed before analysis. Xcalar was built on the premise of keeping data in its original form and still allowing relational computes. Thousands of users can use Xcalar to analyze billions of rows simultaneously, collaborate and share data pipelines with added ML routines, all with an enterprise-grade reliability and with linear scalability.

Reusable dataflows

Dataflows are algorithms developed in Xcalar Cloud Enterprise by users through visual programming, SQL, and structured programming. As users perform actions, execute queries, and run code, dataflows are generated and stored as JSON files describing the metadata and operations. Users can drill down to view schemas, distribution of data across nodes and data skew. These dataflows can be reused by multiple users to enable collaboration and to increase productivity. Dataflows can be saved, operationalized, and scheduled to run on production data with a few clicks.

Data Quality

Xcalar Cloud Enterprise can read and analyze arbitrary data, including data not conforming to the expected format, encoding, type, and attributes. Visibility to exceptions is often important to have during discovery. While processing virtual tables, Xcalar saves non-conformant rows in a separate virtual table that denotes Integrity Constraint Violations (ICVs) for the operation being performed. Users can triage these at any stage of the analytic pipeline. Xcalar enables users to bind models to the data at any stage of processing and to perform a supervised or unsupervised delayed resolution of the ICVs.

Operational workload management

Batch jobs created from dataflow models run with optimized performance, whether periodically or on demand. This enables users to deploy their analytics pipeline at cloud-scale in a secure production environment. During operationalization, Xcalar automatically applies performance optimization changes where applicable, including automatically dropping metadata and moving filter clauses up to execute earlier in the algorithm to minimize data handling downstream. Batch dataflows can be saved, shared across clusters, loaded, parameterized, and scheduled to run. Xcalar Cloud Enterprise operates in a latency-sensitive mode when interactive performance (a few seconds to tens of seconds for billions of rows) is desired. It also operates in a high-throughput mode when petabyte scale workloads require efficient processing. Xcalar provides the flexibility to partition a cluster’s processing power to concurrently handle different kinds of workloads. Please see “Mixed mode operations”, below.

Full memory hierarchy

Xcalar uses the entire memory hierarchy, comprising of DRAM, Storage Class Memory (SCM), and when available, flash, and disk. Users can build clusters within their budget by mixing high-performance DRAM with lower-cost memory options for an optimal price-performance ratio.

Optimized I/O performance

Xcalar has a highly parallel read and write architecture that enables thousands of cores to perform simultaneous read and write operations on disk, network, or streaming data sources. Unlike many systems that are optimized for either reads or writes, Xcalar delivers optimal performance for both operations.

Reliability and fault tolerance

Xcalar runs user code, including user-defined functions (UDFs), parsers, applications, algorithms, and connectors, in containers called Xcalar Processing Units (XPUs). Within an XPU, each custom code module runs in its own address space and jeopardizes neither overall cluster performance nor stability. In case of node failure, the Xcalar quickly recovers using session redo logs.

Extensible architecture

Xcalar is language-agnostic and enables users to write UDFs for custom data transformations and for converting data from custom formats. Xcalar Design provides a simple SDK for users to create new extensions that can be installed and shared in Xcalar Design. For example, programmers can extend Xcalar Design by writing JavaScript extensions to create new UI panels and can map the extended functionality to user-defined functions written in Python. Extensions can also invoke UDFs and the Xcalar Processing Unit functionality.

Skew management

Xcalar Design enables users to detect skew, the commonplace imbalances of data placement across nodes, which can cause an uneven utilization of memory, CPU, and network resources. Xcalar detects and diagnoses data skew across all cluster nodes. Xcalar Design’s comprehensive dashboard both details the occurrence of data skew and provides tools for updating the user’s algorithms to avoid skew.

Backup and recovery

Xcalar creates redo logs that allow recovery from both partial failures (such as aborted transactions) or cluster failures (such as power outage). Logs and datasets can be backed up by users on a schedule. Checkpoints are supported, which allows recovery using hot snapshots of data.

Ad hoc analytics/modeling on hundreds of billions of rows

Xcalar Cloud Enterprise enables users to perform interactive analysis on hundreds of billions of rows including applying relational operations, such as join, union, group by, pivot, filter, aggregate, sort and merge.

File format-agnostic

Xcalar Cloud Enterprise works with source data files of any format. It preserves user data in its original format without the need of converting it into a specific format for Xcalar. Xcalar Data Platform natively supports common file formats, such as CSV, JSON, Excel, XML and the Parquet file format. Xcalar was designed from the ground up to perform relational operations on arbitrary data. Xcalar can work with semi-structured data from various sources; such as log files, JSON documents, or IoT data; structured data, such as rows and columns; unstructured data, such as images and social media content, as well as streaming data sources such as Kafka. New file connectors can be written in Python with a simple SDK.

Lineage and auditability

Dataflows track data from source through each transformation. Dataflows include low-level details, including the path to the data source, intermediate tables created, and the operators applied to these tables. Xcalar Cloud Enterprise’s rich set of features, supported by the dataflows, allow users to drill down to schemas, distribution of data across nodes, and data skew. Dataflows are auditable, and can be both archived as graphics and exported as JSON files.

Enable BI tools to query live data (JDBC and REST)

Xcalar provides JDBC support, which enables integration with BI applications such as Tableau, Qlik, and Power BI. Xcalar virtual tables, data marts or cubes can be queried by any BI tool for visualization and reporting. Users can offload Tableau and Qlik servers when querying multidimensional cubes or data marts. Alternatively, the tables can be exported to databases, data marts or OLAP cubes, providing an end-to-end analytics platform.

Microbatch transactions

Xcalar can handle real-time inserts, modifications and deletions coming in at microseconds intervals while maintaining transactional consistency. Xcalar microbatch dataflows can efficiently process such complex sequences of insert, modify and delete operations on large volumes of data from multiple data sources. Users can either manually schedule microbatch dataflows or initiate them on demand through a RESTful API. Xcalar enables users to view all the transformations performed on their data, including a timeline of inserts and updates. Users can traverse back visually to any point on this timeline.

Memory optimization

Active data is prioritized and auto-tiered higher in the memory hierarchy when Xcalar processes workloads. As users activate or deactivate sessions, shared memory is used or immediately freed up. Also, virtual tables not immediately needed can be exported to disk and later imported, when needed. Xcalar automatically applies memory optimization where applicable during operationalization, including automatically dropping metadata and moving filter clauses up to execute earlier in the algorithm to minimize data handling downstream. Once operationalized, batch jobs are optimized for end-to-end execution efficiency, and consume a small fraction of the memory utilized in modeling.

Mixed mode operations

Xcalar gives users the ability to partition the processing power across the entire cluster to accommodate different kinds of workloads that may be running concurrently. Xcalar can run multiple batch jobs within SLA, while also running latency-sensitive workloads that enable business users to do ad hoc analytics interactively.

Flexible deployment

Xcalar Data Platform can be deployed on-cloud, such as on Microsoft Azure, Amazon Web Services, or Google Cloud Platform, or to on-premises servers and private clouds. Xcalar Data Platform also supports hybrid deployment.

Security and control

Xcalar Cloud Enterprise provides user- and role- based access control including integration with Kerberos, Azure AD, LDAP, OAuth, and other custom authentication and user management services. Xcalar allows admins to control what can be shared within Xcalar. Through the dataflow graphs, algorithms and models can be audited and traced back to identify the source data and the owner of the source dataset.

Collaboration

Xcalar is designed for collaboration. Teams of users can download and share their Xcalar workbooks for efficient and easy collaboration, configure datasets to be shared with other users, or read UDFs from a central repository.

Administration and system management

Xcalar Design includes performance tuning dashboards for system administrators. The dashboards advise on how to improve resource utilization, how to meet service-level objectives when administrators operationalize their workloads, and how to manage data skew within a cluster.