|Creation date||Minimum version|
This article introduces some basic concepts and techniques for handling XML with Xcalar.
XML is a flexible form of data representation with a few important characteristics:
- All data is represented as strings.
- Relies on a hierarchical organization and the nesting of elements.
- Inflates the size of raw data by surrounding content with element tags and other metadata.
- Easily becomes malformed, particularly if not validated, which creates processing difficulties.
- XML can contain markup metadata; markup metadata describes how the data is to be displayed.
- Element tags are not predefined.
While Xcalar can natively import XML data sources as-is, it is often useful to create and apply an import UDF (User Defined Function) to help distill and organize the data during dataset creation. This benefits all users with organized and meaningful data in the dataset.
Understanding XML Elements and how they are imported into Xcalar
This section shows how an XML file (CD_catalog.xml) is imported into Xcalar without a UDF.
The following XML example contains a section from a music catalog that describe albums for sale. It does not reflect any physical formatting to represent the hierarchical structure.
Example 1: CD_catalog.xml XML data file without formatting
In example 2 tabular formatting is introduced, which emphasises the XML element hierarchy. The blue characters represent the XML element tags and the red characters represent the payload data content.
Example 2: CD_catalog.xml XML data file with tabular formatting
A dataset was created natively by Xcalar from the raw CD_catalog XML file , resulting in the following:
- The XML data is intact and fully available for modeling.
- The resultant table contains a single column named Column0, which is a string datatype.
- The hierarchical element order is not preserved, which results in lost hierarchical relationships.
- The nesting level information is not captured. The cells containing opening and closing element tags of nested elements contain no payload data content values.
Figure 1: Resultant CD_catalog Dataset Table with no UDF
Creating an XML UDF
This section describes how a python UDF helps distill and organize the raw XML source data during dataset creation.
The simpleParse UDF was run during the CD_catalog dataset creation, resulting in the following:
- The resultant table contains several columns, which are not all string datatypes.
- The order is not preserved, but is captured in a separate column called lineNum. The order can be restored by running a sort operation on the lineNum column.
- The nesting level information is not preserved, but is restorable by running a sort operation.
- The XML tags and payload data have been separated into their own columns.
- Several rows report errors, which require understanding and correction.
Figure 2: Resultant CD_catalog Dataset Table with UDF
Though a sort operation is required on the linNum column, it is important to first understand why there are errors and to fix them. The chosen python library reports an error for rows that contain tags without elements
To understand the error issue, a filter is run on the error column. As shown in Figure 3.
Figure 3 Applying a filter in Xcalar Design to show only errors rows
The resultant table displays only those rows that contain errors, as shown in Figure 4. Widening the error column displays the full error message.
Figure 4: Resultant CD_catalog Dataset Table after running a Filter Operation
The corresponding error rows are identified in the raw data by red arrows in Figure 5.
Figure 5: Empty Opening and Closing Element Tags of Nested Elements
This issue can be handled in the simpleParse UDF with an exception handler.
Now that the issue is identified a sort operation is run on the lineNum column resulting in the following:
- The order is now visible and correct.
- The hierarchy is not directly visible, but can be derived with a start-tag and end-tag sequence.
- The error rows will be handled in the UDF by an exception handler.
- The payload-tag indicates a column name if the data is transposed into rows.
- The payload-data contains a column value if the data is transposed into rows.
Figure 6: Resultant CD_catalog Dataset Table after running a Sort Operation
Enhancing the simpleParse UDF
This section shows how you can enhance the simpleParse UDF for larger sets of XML data.
The simpleParse UDF is enhanced to transform the raw data into rows containing these columns:
Figures 7 and 8 show the resultant tables after adding enhancements to the simpleParse UDF code, where in Figure 7 the UDF is applied on a larger set of XML data.
Figure 7: Resultant CD_catalog Dataset Table after enhancement 1
Figure 8: Resultant CD_catalog Dataset Table after enhancement 2
This dataset is now ready for analysis, blending, joining, reporting, and whatever else you may wish to use it for.
simpleParse UDF script
The simpleParse UDF leverages a popular open-source python library. For more information, see https://github.com/martinblech/xmltodict.
In order to achieve the resultant tables shown in this article, a single Xcalar import UDF module was created. The UDF module contains two functions:
The examples presented here demonstrate one method for XML parsing. This example does not cover complex scenarios, such as:
- malformed xml
- characterset issues
- hidden unprintable characters
- excessive length of xml entries/files