Using batch dataflows
This section describes tasks related to batch dataflows except the following tasks:
- Scheduling, which is described in Scheduling batch dataflows.
- Parameterizing operations and the dataset, which is described in Parameterizing operations.
Displaying batch dataflows
Follow these steps to display batch dataflows:
- Click to display a list of batch dataflows. All batch dataflows, including the ones created by other users, are displayed.
- Click the name of the batch dataflow in the list to display the dataflow graph.
The icon shown at the end of a batch dataflow graph represents the result of running the batch dataflow. Initially, the result is a folder named after the resultant table in the dataflow. Data generated by the dataflow is stored in csv files, which are saved in this folder.
Actual result of running a batch dataflow
The actual result of running a batch dataflow depends on your selection under Advanced Export Options in the graph before you run the batch dataflow:
Export to export target: The export folder shown in the batch dataflow graph is saved to the default export target or a user-defined target that you can specify in the Parameterize Operation modal window. For information about parameterizing, see Renaming the export folder shown in a batch dataflow graph. By default, the table is exported to a folder at the default export target.IMPORTANT:The name of the export file must not be a duplicate. For example, if running the batch dataflow for the first time creates a file named export-airlines1#L811.csv, re-running the batch dataflow fails because the dataflow attempts to create an export file with the same name. To avoid duplicating names, before re-running a dataflow, either rename the existing export file (for example, to export-airlines00#L811.csv) or parameterize the name for the export file to be created.
Export as a Xcalar table: This option exports the table created by the batch dataflow to a table that will be placed in your active worksheet. You must provide the exported table with a name that does not conflict with any existing table name.
Running a batch dataflow created by Xcalar in modeling mode
You can create a batch dataflow on a cluster in Xcalar in modeling mode, download it to a dataflow file, and then upload it to a cluster running Xcalar in operational mode. For information about moving a batch dataflow from one cluster to another, see Downloading and uploading a batch dataflow.
Before running a batch dataflow created on another cluster, be sure to perform these tasks:
- Verify that the data source path is correct. Be aware that the cluster that creates the batch dataflow and the cluster to which the batch dataflow is uploaded might have different mounted directories.
- Verify that the export operation is parameterized appropriately for your cluster.
- (Optional) This task is needed only if the cluster where the dataflow originates also runs Xcalar in operational mode. Check to see if the batch dataflow contains parameters. If so, assign values to the parameters.
Understanding tables exported from batch dataflows
The table exported to the worksheet from a batch dataflow works in the same way as any other table. It has its own dataflow graph, from which you can create another batch dataflow.
However, some differences exist between the table created through a batch dataflow and other tables, as described below:
- All columns are derived columns. Column names do not have a prefix.
- You cannot trace the lineage of the data in this table. For example, you cannot determine the source of the data in the table or whether the data type of a column has been changed.
You can see the export folder name below the last icon of the batch dataflow graph. You can change the folder name by following these steps:
- Click the last icon in the batch dataflow graph. Then click Create parameterized operation in the pop-up menu.
In the Parameterize Operation modal window, follow one of these steps to specify the export folder name:
- Enter an export folder name that does not conflict with any existing folder name at the export target.
- Specify an export target that is different from the target that contains the existing export folder.
Both the export folder name and the export target can be parameterized. For example, you can create a parameter named suffix and insert the parameter into the export folder name or export target name. For more information about parameterization, see Parameterizing operations.NOTE: The parameterized target name must refer to a target that has been created. For example, if the parameterized target name is myTarget1 but no target under that name exists, an error message is displayed.
In the dataflow graph, the name under the last icon is changed to the new folder name if one is specified.IMPORTANT: If you choose to parameterize the export folder name or target name, be sure to use a parameter created by you. Do not use the N parameter, which is a system parameter reserved for scheduled batch dataflows. If the export folder or target name contains a system parameter, a warning notifies you that the parameter is disregarded. If you click CONFIRM to proceed, the batch dataflow is run with no parameters in the export folder and target names.
The following screenshot illustrates how you can specify the export folder name and export target name.
Running a batch dataflow
Follow these steps to run a batch dataflow:
In the batch dataflow graph, click Advanced Export Options to specify the name of the file system or table to which the result of the batch dataflow is exported. (If none is specified, a default name is provided.)The following list describes the naming conventions:
- The name is case sensitive.
- Characters must be A-Z, a-z, 0-9, hyphen (-), underscore (_), and space.
- First character must be a letter.
- Length must not exceed 255 characters.
- The name must not be a duplicate of an existing table name in the same workbook.
Click to run the batch dataflow. When the batch dataflow is in progress, the icon turns into a spinner. You can click the spinner to terminate the execution of the batch dataflow.
The following list describes how the batch dataflow graph changes when a batch dataflow is in progress or is finished:
- When the batch dataflow is in progress, there is a tooltip showing the status of each table that has been created or is being created.
- For each table being created, the icon is yellow, and a progress bar is displayed.
- After a table is created, the dataset and table icons turn green. You can hover over each icon to display the time taken to complete the operation and the number of rows in the table.
- If the batch dataflow fails to complete, a red dataset or table icon is displayed to show where the problem occurs.
The following screenshot shows an example of a batch dataflow.
You can also run a batch dataflow according to a pre-defined schedule. For information about scheduling a batch dataflow, see Scheduling batch dataflows.
You can download a batch dataflow to your client computer, the system where the browser for Xcalar Design is running. This download process creates a transportable .tar.gz file. It is also called a dataflow file. After the .tar.gz file is downloaded, you can upload it to any Xcalar cluster from your client computer with a Xcalar Design browser session.
The download location is determined by a browser setting. For example, in your Chrome browser, display the advanced settings and then go to the Downloads section to specify a path for storing all downloaded files.
Understanding the dataflow file contents
The dataflow file is a portable representation of your data model. It contains all the operations and parameterization, which constitute the algorithm designed through your modeling process. It does not retain the values that pertain to the actual running of the dataflow. You must supply those values after uploading the dataflow. The following sections describe specifically the types of information that are included or excluded in a dataflow file.
What is included in the dataflow file
The dataflow file includes the parameters used before exporting.
The dataflow file does not include the following information:
- The values for any parameters used in the dataflow. If you upload the dataflow, you must assign new values to the parameters used in the dataflow.
- Parameters used in the export operation. If the original batch dataflow exports to a user-defined target or a parameterized export folder, you cannot obtain the target and export folder information from the dataflow file. If you run the uploaded dataflow, it exports to a default folder at the default target location. You must parameterize the export operation in the uploaded batch dataflow so that the dataflow exports the result to the desired location.
- Schedule defined for the dataflow. After you upload the dataflow, to run it regularly, you must re-define the schedule. The schedule previously associated with the original dataflow is not retained during the download.
To download a batch dataflow, click for the selected batch dataflow in the batch dataflow list.
The following screenshot shows the icons for downloading and uploading batch dataflows.
You can upload a batch dataflow that was downloaded from the same Xcalar cluster or any other Xcalar cluster. If the dataflow file has a UDF associated with it (for example, a UDF is used in one of the operations in the dataflow), the UDFs of the cluster are affected in one of the following ways:
- If the module name of the UDF does not conflict with any module name on the cluster, the module containing the UDF is also uploaded to the cluster. For example, if the dataflow file uses the UDF named calendar:getWeekday, the calendar module is added to your cluster when you upload the dataflow file.
If the module name of the UDF conflicts with a module name on the cluster, the result depends on whether you choose to overwrite the module on the cluster with the module in the dataflow file:
IMPORTANT: You cannot revert a module after it is overwritten by a dataflow file upload. Enable the overwriting only after you verify that you no longer need the contents of the existing module that has the same name as that in the dataflow file.
- If you choose to overwrite the module on the cluster, all contents of the module on the cluster are replaced by the contents of the module in the dataflow file. For example, if the dataflow file uses the UDF named calendar:getWeekday, the module named calendar on the cluster is overwritten by the module from the dataflow file.
If you do not choose to overwrite the module on the cluster, the same function on the cluster is used when you run the dataflow from the dataflow file. If the function by the same name does not exist in the module, the attempt to run the dataflow fails.
For example, if the dataflow file uses the UDF named calendar:getWeekday and the cluster also has a UDF by the same name, the calendar:getWeekday UDF on the cluster is invoked when you run the dataflow from the dataflow file. However, if the calendar module on the cluster does not contain a getWeekday function, you cannot run the dataflow. You must copy the function to the module on the cluster before trying to run the dataflow again.
To upload a dataflow file, follow these steps:
- Click the icon for uploading a dataflow file.
- In the Upload Dataflow window, specify the following information:
- The location of the dataflow file, which must be in the .tar.gz format. You can click BROWSE to navigate to the location and select the file.
- Name of the dataflow if you want to change the name.
Whether the UDFs in the dataflow file will overwrite the ones that exist on the cluster if these UDFs have the same module name.IMPORTANT: By default, the UDF modules are not overwritten. Xcalar strongly recommends that you back up your modules if you decide to enable overwriting. Without a backup copy, after a module is overwritten, you cannot restore the contents. If you overwrite a UDF that other workbooks depend on, those workbooks cannot be activated.
- Click UPLOAD.
Saving or displaying a batch dataflow graph as an image file
You can save a batch dataflow graph as an image (for example, to include the image in a presentation or to share the image with others). The image is saved in the PNG format at the download location, which is determined by your browser setting.
Alternatively, you can display the dataflow graph as a PNG image in a new tab of your browser.
Simply right click anywhere in the batch dataflow graph to display a menu. The menu options enable you to save the image or display the image in a new browser tab.