|Creation date||Minimum version|
This knowledge base article discusses batch dataflows and how to create a cascading batch dataflow.
Understanding batch dataflows
The Xcalar batch dataflow feature allows you to run a set of operations on all source datasets within the dataflow. These operations can be scheduled to run on-demand or within a specified time. Also, as the data in your data source changes, new insights can be developed by periodically running the batch dataflow.
There are cases where the resulting table of the batch dataflow can be useful as input for other sets of operations by other users. For these cases you can create a cascading batch dataflow. A cascading batch dataflow can be created from either the first batch dataflow or subsequent cascading batch dataflows that were created from the first batch dataflow, as described in the examples below:
- Creating a second batch dataflow from the first batch dataflow.
Users can create and develop another set of operations from the resulting table of the first batch dataflow, which is then saved as a second batch dataflow. As this second batch dataflow requires the results of the first dataflow, the ordering or cascading of the two batch dataflows continues to exist. For example:
- Creating additional batch dataflows from the second batch dataflow.
Users can create and develop a third set of operations from the second batch dataflow. These batch dataflows depend on the results of the second batch dataflow and create cascading batch dataflows, such as in the following example:
The following example, shows how a fourth set of operations, developed by another user, might also rely on the results of the first batch dataflow:
The above examples explain the power of cascading batch dataflows. Cascading batch dataflows allows multiple users to develop multiple sets of operations from the original source datasets.
Xcalar maintains the data lineage across the entire workflow, which allows you to trace the ancestry of any field.
Creating a cascading batch dataflow
This section describes the steps to create a cascading batch dataflow.
The initial batch dataflow is developed by User A, which is then passed to User B for additional modeling.
To create a cascading batch dataflow:
- User A logs in to Xcalar Design and creates a series of modeling operations from the source data.
- From the Dataflow graph, User A clicks Create batch dataflow.
The DATAFLOW panel appears.
- In the Batch Dataflow Name field, User A enters a name for the batch dataflow
- In the Columns to Export section, User A selects the Select All checkbox.
- User A, clicks CREATE, which creates a batch dataflow.
Figure 1: Xcalar Design - specifying the creation of a batch dataflow
Figure 2: Xcalar Design - specifying the name and exported columns for the batch dataflow
The batch data flow is added to the list of other batch dataflows available on the cluster.
Figure 3: Xcalar Design - list of batch dataflows existing in the cluster
After creating the batch dataflow, User A passes the name of the batch dataflow to User B. User B then adds additional operations to the batch dataflow and creates a second batch dataflow from the results. This second batch dataflow is known as a cascading batch dataflow.
Adding additional operations to an existing batch dataflow
To add additional operations to an existing batch dataflow:
- User B logs into Xcalar Design, runs the first batch dataflow created by User A, and enters a name for the resultant table.
Figure 4: Xcalar Design - specify the exported table name and run the batch dataflow
The batch dataflow creates a table with the specified name in the worksheet.
Figure 5: Completion of the batch dataflow and creation of the resultant table
- User B builds upon the exported table by creating additional operations.
Figure 6: Adding additional operations (a filter and a map)
- From the updated Dataflow graph, User B clicks Create batch dataflow.
The DATAFLOW panel appears.
- In the Batch Dataflow Name field, User B enters a name for the batch dataflow.
- In the Columns to Export section, User B selects the Select All checkbox
Figure 7: Xcalar Design - specifying the name and exported columns for the cascading batch dataflow
- User B, clicks CREATE, which creates a second batch dataflow.
The second batch data flow is added to the list of other batch dataflows.
- User B runs the second cascading batch dataflow and directs its output into another table in the worksheet.
Figure 8: Xcalar Design - specify the exported table name and run the batch dataflow
Running the cascading batch dataflow creates a table with the specified name.
Figure 9: Completion of the batch dataflow and creation of the resultant table
The resultant table is added to the worksheet.
Figure 10: Resultant table from the batch dataflow added to the worksheet