The dataflow graph provides an audit trail for your data. It helps you trace data lineage through all stages of your analytics pipeline. Over time, the data in your table might become increasingly complex due to the operations performed on the table. You can use the dataflow graph to determine how the data has been transformed since the table was created. For example, you can determine when a value in a column was filtered out or what operation was used to create a particular column.
Overview of the dataflow graph
The dataflow graph panel displays the dataflow graphs for all active tables in the same worksheet. The following screen shows an example of a dataflow graph.
Example of a dataflow
In the following example, a series of operations take place on a table created from a dataset with airlines data:
- A table named airlines4#XS10 is created.
- A Map function is executed to change the data type of the DayOfWeek column to Integer. In this example, the user locks the resultant table to prevent it from being dropped accidentally.
- A Map function is executed to change the data type of the DepTime column to Integer, but only erroneous rows are included in the resultant table. For more information about erroneous rows, see Creating a table with only erroneous rows.
- The table named airlines4#XS10 is dropped. This table is removed permanently. You cannot add it back to your worksheet.
Collapsing and expanding table icons in a dataflow graph
Xcalar Design automatically collapses multiple table icons in a dataflow graph for better readability. The following screenshot shows the icon that indicates some table icons are collapsed in the dataflow graph. You can hover over the icon to see the number of collapsed table icons. Click the icon to expand the table icons.
The following partial screenshot shows the icon that you can click to collapse a list of tables.
Displaying dataset information
To display detailed information about the dataset (for example, the path to the data source, the parameters and import UDF used when data is imported, and so on), click the dataset icon at the beginning of the dataflow graph.
You can click a table icon in the dataflow graph to display a pop-up that lists the actions you can take. The possible actions depend on whether the selected table is in the worksheet or is a temporary table or active table.
If a table is displayed as the last table in any dataflow graph, it is an active table. All other tables in the dataflow are temporary tables. For more information about different table statuses, see Understanding and changing table statuses.
The following partial screenshot shows the options available when you click a table icon in a dataflow graph.
You can perform these tasks only on temporary tables:
Adding a table to the worksheet: You can click a temporary table icon in the dataflow graph and add the table to the worksheet. After you add the table to the worksheet, the table becomes an active table and has its own dataflow graph.NOTE: After you add a temporary table back to a worksheet, the drop-down menu invoked from its icon no longer contains the Add table to worksheet option. Instead, the menu provides the Find table in worksheet option.
Reverting to the selected table: If you want to restore the table to a particular state, click a table icon in the dataflow graph and revert to it. This causes the operation icons and the table icons to the right of the selected table icon to disappear.NOTE: Tables removed from the dataflow graph due to a revert are not dropped. They are categorized as Temporary in the Tables list. Temporary tables can be added back to the worksheet.
You can perform these tasks on active tables only:
- Finding the table in the worksheet: You can quickly locate a particular table in a worksheet by clicking its icon in a dataflow graph. Do not maximize the dataflow graph panel so that the table you try to locate is visible.
- Hiding the table: You can hide a table that you do not use often. The hidden table does not show up in the worksheet, but can be added back at any time. Also, the dataflow graph of a hidden table is removed from the dataflow graph panel.
You can perform these tasks on temporary and active tables:
Dropping a table: You can drop a table if the table is no longer needed. The result of dropping a table depends on whether the table is temporary or active:
IMPORTANT: You cannot undo the action of dropping a table.NOTE: You cannot drop a table if an operation is taking place on that table.
- If you drop a temporary table, its icon in the dataflow graph is changed to gray. If the table is a part of other dataflow graphs, its icons in those dataflow graphs become gray as well. You cannot perform any operations on a dropped table.
- If you drop an active table, the table and its dataflow graph are permanently removed from Xcalar. However, the temporary tables represented by icons in the dataflow graph are unaffected by your dropping the active table. These temporary tables continue to exist.
- Locking a table: To prevent a table from being dropped accidentally, you can lock it. A lock icon is displayed on the table icon to show that the table is locked. If the table exists in multiple dataflow graphs or multiple worksheets, the lock is applied to all occurrences of the table icon.
Show schema: A pop-up shows the number of rows in the table, in addition to the fields and their data type in the table. Clicking a field in this pop-up enables you to trace its lineage in the following ways:
NOTE: You can trace the lineage of a field even if its lineage includes a table that has been dropped.
- If you click a field name, you can trace the lineage of the field. One or more table and dataset icons are highlighted. The highlighted icons represent all tables and datasets that contain the fields from which the field evolves.
- If the field clicked in the Show schema pop-up is not created from other fields, the highlighted icons enable you to locate the first table that includes the field.
To remove the highlight, click a highlighted table icon.EXAMPLE: After a Join operation, it might not be obvious where a particular field in the resultant table comes from because multiple tables (or datasets) are involved in creating the resultant field. By clicking the field in the Show Schema pop-up, you can determine which tables and datasets contain the fields used in the operations that result in that field.EXAMPLE: Suppose you use an extension to generate row numbers in a new column. After you perform many operations, you might want to find the table where this column first occurs. Use Show schema from a table icon with this column. All tables containing this column are highlighted, which makes it easy to locate the table where the column is created.
- Create a table with erroneous rows only: You can use the same function to create a version of this table, except that the resultant table contains only erroneous rows. This option is available if the selected table is created by a Map or Group By function. For more information about tables with erroneous rows only, see Creating a table with only erroneous rows.
Create a complement table: If a table is a result of a filter operation, you can create its complement.
Saving or displaying a dataflow graph as an image file
If you want to save a dataflow graph as an image (for example, to include the image in a presentation or to share the image with others), click . The image is saved in the PNG format.
Alternatively, you can click to display the dataflow graph as a PNG image in a new tab of your browser.