This section explains basic concepts you must understand to use Xcalar Design efficiently.
Understanding the purpose of Xcalar Design
Xcalar Design is an HTML5-based visual tool that enables you to interactively and intuitively design algorithms through elementary operations. By manipulating the values in the tables presented in the graphical user interface, you can clean and consolidate your data before performing data analysis. You can also publish data from Xcalar Design to a Jupyter Notebook, an open-source, web-based application, for further data analysis or visualization.
When you use Xcalar Design interactively, it generates a dataflow to show a sequence of operations leading up to each table.
If you use Xcalar in operational mode, you can save dataflows for you to run on demand or at scheduled intervals. If you use Xcalar in modeling mode, you can save dataflows but you cannot run them. To run a saved dataflow, use Xcalar Design to download it and then upload it to a Xcalar cluster with an operational mode license.
Xcalar Design can import data that is unstructured or organized in arbitrary formats. The data sources can reside on file systems such as NFS, HDFS, Amazon S3, or a local file system.
Understanding Xcalar Design login
Each Xcalar Design user can have one login at a given time. Closing the browser window does not automatically log you out of Xcalar Design. If you are logged in to Xcalar Design and you try to log in again (for example, from another browser tab), a message informs you that you are already logged in elsewhere. You can either go back to the login screen or continue the login, which causes Xcalar Design to terminate your other connection.
Login names are case-insensitive.
Shared vs. shared-nothing storage for Xcalar
Xcalar's True Data in Place algorithms enable you to store source data on shared or shared-nothing storage, without having to move your data in advance to attain maximum processing speed.
Comparison between the Xcalar and other architectures
Xcalar's unique architecture eliminates the need for data sharding, partitioning, or placement for node affinity, which is commonly required by other data analytics tools using software frameworks such as Hadoop and Spark. In these frameworks, you typically shard the data for locality of reference to a node where a particular algorithm is running. The placement of data occurs either at the start or when a MapReduce job begins to execute. Without proper placement, you cannot achieve optimal data processing speed.
Xcalar's nodes, however, read data in parallel and process it optimally irrespective of locality. Furthermore, while processing the data, Xcalar does not shuffle data as a MapReduce job would. Data has mass, and movement of terabytes of data for locality of reference at the start of processing, or later, during the shuffle phase, incurs huge processing and performance costs. Xcalar's capability to keep source files in their original location from beginning to end greatly simplifies deployment and improves data processing performance.
Examples of Xcalar using all cores for parallel I/O
In this example, a 4-node cluster with 32 cores processes files in parallel without requiring the files to be moved from one storage space to another or redistributed among nodes. When you use Xcalar Design to import 1,024 files, the nodes of the cluster read the files in batches of 32, utilizing the full parallel I/O bandwidth from all the cores. Files may be read asynchronously or synchronously, where the former option may provide higher throughput.
The following figure illustrates the cluster processing 1,024 files.
Suppose you use the same cluster to import 1,025 files. Again, the 32 cores read 32 files at a time. The last file is processed in its entirety by only one core.
Similarly, if you only need to import one file, only one core is used to process the file.
In cases where only one core processes a source file, Xcalar does not arbitrarily break the file into chunks to parallelize I/O as that would violate the sequential integrity of the data. Remember that each core reads a file in its entirety; the core never reads a partial file. Therefore, if you want to take advantage of the bandwidth of all cores, you must split the source file into multiple smaller files in a way that can be understood by your logic when you perform modeling later on.
You can create workbooks in Xcalar Design, which can be regarded as files. Within a workbook, you can create as many worksheets as you want. Worksheets in the same workbook usually contain data related to the same project. The worksheet displayed on your screen is the active worksheet.
At any time, only one workbook can be active. You can deactivate or activate a workbook as desired.
The relationship between a worksheet and a workbook is similar to the relationship between a Microsoft Excel worksheet and workbook.
The following sample screen shows a Worksheet window for a workbook titled MyNewWorkbook.
Understanding data sources, datasets, and tables
The raw data you want to analyze can be stored in one file, or multiple files in the same directory or different directories. The term data source refers to the file, files, or directory containing the raw data.
After you import data from a data source, you can pull fields of interest from the dataset to create tables in a Xcalar Design worksheet. More fields from the dataset can be added to an active table at any time.
You can manipulate and transform data within a table (for example, by sorting or filtering) or manipulate multiple tables, which can be in different worksheets, to create a new table (for example, by joining). Use data operations to implement an algorithm that you design for deriving meanings from your data. With Xcalar's True Data in Place architecture, data is imported and analyzed without any ETL (Extract, Transform, and Load).
Each table is created with a name unique in the workbook. You can move tables between worksheets but not workbooks.
Effects of a Xcalar Compute Engine restart on your workbook
After Xcalar Compute Engine is restarted, all workbooks become inactive. This means that you must re-activate a workbook to resume your work.
Re-activating a workbook in this scenario requires that Xcalar Design can access all data sources for creating datasets used in the workbook. The path names to the data sources must be the same as before so that Xcalar can rebuild the datasets in the same way as it did when you imported them the first time.
Data formats supported
Xcalar Design natively supports CSV, JSON, Excel, and raw text, and can support other formats using UDFs (user-defined functions).
How Xcalar Design saves your work
Xcalar Design saves the results of all data operations as they are being completed. For example, after you filter a column, the resultant table is saved automatically. You do not have to click Save.
If your work is for manipulating the appearance of a table or a table column, Xcalar Design automatically saves your work in these situations:
- It saves your work every two minutes. (You can change the time interval in the General Settings window as described in Changing the environment settings for Xcalar Design.
- It saves your work when you sign out. However, before this automatic save, if you refresh your browser, your browser displays a message warning about unsaved changes. You must click Save to avoid losing the changes. The following screenshot shows the location of Save.
How to tell if there are unsaved changes
The following indicators show that there are unsaved changes in your workbook:
- An asterisk displayed next to Save at the bottom of your browser window. If you hover over Save, you can see a tooltip showing the time of the last save.
- A blue dot on the Xcalar icon in the browser tab.
To manually save changes, click Save as shown in the following screenshot.
Understanding what entities are shared by Xcalar users
Because multiple users can use
- Batch dataflow, including its associated parameters and schedule.
Dataset.NOTE: While all users can use the same dataset, only the dataset creator can delete the dataset.
- Export target.
- UDF (user-defined function).
In addition, memory is shared. The amount of memory used by one user affects the amount of memory available for other users.
The following list describes the entities that are not shared among Xcalar users:
In addition, the general settings that control Xcalar Design user interface elements are not shared.
Understanding the license key
To use all features offered by Xcalar Design, a valid license key must be entered during installation. Without a valid license, you cannot perform modeling by using operations such as filtering, finding aggregates, sorting, and joining. You can, however, import and export data.
If you cannot perform operations due to an invalid license, contact your Xcalar administrator as soon as possible. To ensure uninterrupted services, make sure that you have a valid license at all times.
If you are an administrator, you can update the license in the Setup panel of the Monitor. For more information about the Monitor, see Using Setup (Xcalar admin only).
Xcalar Compute Engine product version and license information
To display Xcalar product versions and the license expiration date, follow these steps:
- Click in the upper right corner of the Xcalar Design window.
In the drop-down menu, click About to display a modal window that lists product versions, license expiration date, and copyright information.
If the License Key Expiration is Unlicensed, the cluster does not have a valid license key, and you can perform only a limited number of operations.