Resolving Integrity Constraint Violations with Xcalar Data Platform
An Integrity Constraint Violation (ICV) is a database term that describes a situation where a given set of fields in a table do not comply with the table’s schema. ICVs can occur for a number of reasons — incorrect data type casting, dirty source data, errors in runtime data transformation processes, upstream sources changing data schemas in a way that breaks backward compatibility — and identifying and fixing them typically requires a great deal of time investment on the part of DBAs and data analysts. But this time investment can be dramatically decreased by using Xcalar Data Platform (XDP), which detects and notifies the user of ICVs when importing and processing data.
ICV Detection While Importing Data
The Xcalar platform operates directly off the data lake, which could potentially be schema-less. But rather than impose a schema on read, the platform’s True Data in Place architecture maintains the data’s original format. It eschews lossy importation methods such as flattening a record or extracting fields by definition, thereby avoiding losing information that could be critical for discovering meaningful insights. Flatting a JSON file, for instance, means losing the order of the elements, their hierarchy, and their relative structure. Xcalar instead offers the ability to either dynamically impose auto-detected schemas or accept a user-defined schema, thereby offering infinite flexibility for doing discovery.
In this context, an ICV can occur when schema detection is not entirely feasible. To protect against this possibility, XDP detects and alerts the user of such a scenario, as shown below.
When a user selects a data source to import, Xcalar Design displays both a preview of the data and a list of options. And underneath “Advanced Options” is the error tolerance option, which determines how XDP reacts to errors during the final importing process.
By default, error tolerance is set to low. This causes the import to automatically fail on error.
To demonstrate, a single double quote has been inserted into this 22,000 row CSV file. Without an accompanying closing double quote, this results in the rest of the file being treated as a single, enormous string. The imported table will only have 6 rows, which is obviously incorrect.
With error tolerance set to low, Xcalar aborts the import, then displays the error message seen above that lists the suspected origin of the error. The user is thus immediately aware both that there is a problem and where to fix it.
ICV Detection While Processing Data
XDP was built to execute sophisticated business logic, with the expectation that it would be processing data through complex dataflows to transform it into useful business insights. As dataflows are expected to run frequently, if not constantly in the case of real-time analysis, Xcalar offers late binding resolution of any errors that occur during data operations. Instead of halting operations until an ICV is resolved, Xcalar flags all fields that break schema integrity constraints, replaces them with “FNF” (Field Not Found), and then allows the flagged fields to continue through the dataflow. This provides users with a full visualization of the data, allowing appropriate corrective actions to be taken.
To demonstrate, consider how a type-casting operation can potentially trigger an ICV.
Observe this sample CSV file. The majority of the values in the Inventory column are numerical, but as the arrows indicate there are also two string values.
When importing a table from an unstructured data format like CSV or JSON, XDP dynamically detects the most appropriate schema. In this case, it classifies the values in Inventory as strings to preserve the “N/A” fields. But this decision makes mathematical operations on Inventory difficult, so the user may decide to convert the values in Inventory into integers later with Cast nodes.
But “N/A” is not a number, so it can not be properly converted. As such, Xcalar automatically replaces it with a FNF.
Aside from losing their original value, FNFs are treated like any other valid field. They can be profiled, filtered out, or left untouched without damaging the overall dataflow. And in the discovery phase of the big data analytic process, they can serve as an indicator of potentially interesting data. As the saying goes, you can learn from your mistakes.
When an enterprise is pinning its future on data-driven decision making, it is essential that the data it uses be accurate. But that data is often stored in a multitude of formats, and combining these various datasets requires joining ill-defined data fields. That is why it is so crucial that XDP alerts users to potential ICVs at every possible stage, as that way schema violations can be immediately identified and resolved. But we also understand that immediate error resolution is not always feasible. Hence why when processing data, XDP only flags erroneous data while still producing the requested data transformation with the remainder of the data.
While ICV detection tends to be an overlooked topic when the conversation turns to data platforms, it still plays a vital role for the user in the data transformation and analytics process. That is why the Xcalar platform feature comprehensively implemented safeguards against ICVs built into its design.