Understanding and creating batch dataflows

Xcalar Design provides you with built-in functions for you to cleanse data and transform data to a useful format. In addition, you can use UDFs to extend the Xcalar Design features to suit your data analysis methodology. Executing these functions interactively enables you to see results almost instantaneously. However, it does not show the results for the entire data source. To gain insight from the entire data source instead of a sample of your data, you must run a batch dataflow.

Understanding dataflows

When you use Xcalar Design to design an algorithm for data analysis, a dataflow is created automatically. It consists of a series of elementary functions executed and tables created when you use Xcalar Design interactively for modeling. You can view the dataflows leading up to each active table so that you can visually trace the origin and movement of data over time. For information about determining data lineage, see the description about showing the schema for a table in Temporary and active tables.

EXAMPLE: You pull three columns from your data source to create a table and use the Smart Type Casting feature to change the data type, and then you perform a Filter operation on one column. The dataflow shows that the table undergoes one Map operation to change the data type and one Filter operation, and each operation results in a temporary table. The active table, named airlines2#h9838, is shown as the last table in the dataflow. The following screenshot illustrates the dataflow graph displayed in Xcalar Design for airlines2#h9838.

NOTE: The table icons in a dataflow graph are usually blue. A red table icon represents a table containing only erroneous rows. For more information about erroneous rows, see Creating a table with only erroneous rows.

Understanding batch dataflows

While a dataflow graph enables you to visually trace data lineage, you cannot re-run the sequence of operations as shown in the graph. To do so, you must save the dataflow under a name. A saved dataflow is called a batch dataflow. It is shared among all users on the cluster.

Batch dataflows offer these advantages:

  • You can run the set of operations in the dataflow on demand or on a specified schedule. As the data in your data source changes, you can develop insights from your data source over time.
  • The set of operations are run against the entire data source.
  • Because the operations in a dataflow are automated, you can avoid human errors that might happen if you had to manually execute the operations.
  • The batch dataflow is permanent in the sense that the operations can be performed even after you drop a temporary table in the dataflow. Suppose you drop a temporary table to release memory, you can still create a batch dataflow from the dataflow where the table is dropped.

    NOTE:  While you can see the icon for the dropped temporary table in the batch dataflow graph, the table is not put back to the workbook, and you cannot access it in Xcalar Design.
  • If you drop an active table, the dataflow associated with the table is also removed from Xcalar Design. Table dropping cannot be undone, but if you have created a batch dataflow for this table, you can run the dataflow at any time to re-create the table.
  • You can export selected columns from the table or the entire table created by a batch dataflow to another application. Alternatively, you can export the result of the batch dataflow to a table, which automatically appears in your worksheet.

Displaying dataflows for all active tables

To display the dataflows for all active tables in a worksheet, click in the lower right corner of the Xcalar Design window. For more information about each element in the dataflow graph, see Dataflow graph.

Creating a batch dataflow

Follow these steps to create a batch dataflow:

  1. Click to view the dataflows for all tables in the worksheet.
  2. Locate the dataflow for the active table. This is the dataflow from which you create the batch dataflow.
  3. Click in the upper right corner of the dataflow graph to display the DATAFLOW panel.
  4. In the panel, type the name of the dataflow.

  5. Select the columns to be included in the batch dataflow result. If the batch dataflow does not include a Join operation, skip the next step. Otherwise, go to the next step.
  6. If the dataflow result contains columns that have duplicate names, you are prompted to rename the columns.

    NOTE: All columns created by a batch dataflow are derived columns. If a column before the Join operation has a prefix in the name, the prefix is removed when the batch dataflow creates the table. Therefore, if the left table in the Join operation contains a column named airlines1::Carrier and the right table contains a column named carrier::Carrier, the column names are considered duplicates.

    The following partial screenshot of the DATAFLOW panel illustrates how to rename a duplicate column name when creating a batch dataflow. You can either type a new column name or accept the suggested new column name. The column name does not include a prefix.

  7. Click CREATE. The dataflow result, whether it is a file or table, contains only the selected columns.

Go to top