Creating a table with only erroneous rows
Erroneous table rows are rows in which a function fails to produce a valid value. These rows might be of interest because you can gain insights from erroneous data (for example, by finding unexpected values in an analysis of data for fraud detection) or because you want to cleanse data before performing further data transformation. Creating a table that contains only erroneous rows can help you locate some integrity constraint violations in your data source.
Examples of erroneous rows
The following list describes the scenarios where Xcalar Design detects invalid values and creates a table with erroneous rows only:
- Suppose a syntax error occurs when the values are inserted or updated in the data source before Xcalar Design imports the data. You can create a table with erroneous rows only to detect the violations. For example, if the Age field contains a string such as 5O (instead of 50) due to a data entry error, when you run a function to cast this field to Integer, the cell with 5O is shown as FNF (field not found) in the resultant table. If you create an erroneous-rows-only table when running the function, the resultant table provides an explanation for why the cell with 5O cannot be converted to an integer.
Suppose a column contains FNF in some cells. You can run a function to create a table with erroneous rows only to identify where FNF occurs.
- If you use the div function and the divisor is 0, the resultant rows are considered erroneous rows.
In each erroneous row, a message is provided to explain the error. The message helps you determine the type of integrity constraint violation in the source data. The following list shows some examples of the messages:
- Cast operation failed
- Some variables are undefined during evaluation
- Divide by zero error
Functions for creating a table with erroneous rows only
The Map and Group By functions enable you to create a table with only erroneous rows. This means that when you use the Map or Group By function to create a table, you can specify that the table includes only rows in which the function cannot produce a valid value.
This section assumes that you understand how to use the Map and Group By functions in Xcalar Design. If not, familiarize yourself with these functions by reading the following topics:
Example of a Map function for creating a table with erroneous values
In the table created from a dataset with flight information, the column named ArrDelay contains the number of minutes for each flight's arrival delay. Suppose the initial data type for this column is String, you can use the type-casting function in the MAP panel to change its type to Integer. If you select Include only erroneous rows in the MAP panel, the resultant table contains only the rows where the function fails to produce a valid value.
The following screenshot illustrates the MAP panel for changing the data type of a column. In this example, row 20 is where the type casting function encounters an error. The other rows in this screenshot will not be present in the resultant table.
The following screenshot illustrates the resultant table with only erroneous values in the ArrTime column.
Re-running a function in a dataflow to create a table with erroneous rows only
This section assumes that you are familiar with the dataflow graph in a worksheet. For information about dataflows, see Dataflow graph.
In a dataflow graph, you can create a table with erroneous rows only through a Map or Group by function previously executed. For example, in the following dataflow graph, you can click the icon for the active table (last icon in the dataflow) to display a pop-up menu. In the menu, you can select Create erroneous rows table to create a table with the same Map function shown in the dataflow. This method has the same result as selecting Include only erroneous rows in the MAP panel.
The following list describes the results of creating a table with erroneous rows from the dataflow graph:
- A new table is created by the Map or Group By function. The new table contains only rows with errors. The string _er is appended to the table name. For example, if the table in which you execute the function is named airlines, the table with erroneous rows is named airlines_er.
- The new table is an active table with its own dataflow graph. The table icon in the graph is red to indicate that the table contains erroneous rows only.
- The dataflow graph from which the new table is created stays in the worksheet and remains unchanged.
The following screenshot is an example of a dataflow graph with an erroneous rows table.