Creation date Minimum version
2017-8-25 1.2.1

Introduction to XML Parsing with Xcalar

This article introduces some basic concepts and techniques for handling XML with Xcalar.

Overview

XML is a flexible form of data representation with a few important characteristics:

  • All data is represented as strings.
  • Relies on a hierarchical organization and the nesting of elements.
  • Inflates the size of raw data by surrounding content with element tags and other metadata.
  • Easily becomes malformed, particularly if not validated, which creates processing difficulties.
  • XML can contain markup metadata; markup metadata describes how the data is to be displayed.
  • Element tags are not predefined.

While Xcalar can natively import XML data sources as-is, it is often useful to create and apply an import UDF (User Defined Function) to help distill and organize the data during dataset creation. This benefits all users with organized and meaningful data in the dataset.

Understanding XML Elements and how they are imported into Xcalar

This section shows how an XML file (CD_catalog.xml) is imported into Xcalar without a UDF.

The following XML example contains a section from a music catalog that describe albums for sale. It does not reflect any physical formatting to represent the hierarchical structure.

Example 1: CD_catalog.xml XML data file without formatting

<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD>
</CATALOG>

In example 2 tabular formatting is introduced, which emphasises the XML element hierarchy. The blue characters represent the XML element tags and the red characters represent the payload data content.

Example 2: CD_catalog.xml XML data file with tabular formatting

<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD>
</CATALOG>

A dataset was created natively by Xcalar from the raw CD_catalog XML file , resulting in the following:

  • The XML data is intact and fully available for modeling.
  • The resultant table contains a single column named Column0, which is a string datatype.
  • The hierarchical element order is not preserved, which results in lost hierarchical relationships.
  • The nesting level information is not captured. The cells containing opening and closing element tags of nested elements contain no payload data content values.

Figure 1: Resultant CD_catalog Dataset Table with no UDF

Creating an XML UDF

This section describes how a python UDF helps distill and organize the raw XML source data during dataset creation.

The simpleParse UDF was run during the CD_catalog dataset creation, resulting in the following:

  • The resultant table contains several columns, which are not all string datatypes.
  • The order is not preserved, but is captured in a separate column called lineNum. The order can be restored by running a sort operation on the lineNum column.
  • The nesting level information is not preserved, but is restorable by running a sort operation.
  • The XML tags and payload data have been separated into their own columns.
  • Several rows report errors, which require understanding and correction.

Figure 2: Resultant CD_catalog Dataset Table with UDF

Though a sort operation is required on the linNum column, it is important to first understand why there are errors and to fix them. The chosen python library reports an error for rows that contain tags without elements

To understand the error issue, a filter is run on the error column. As shown in Figure 3.

Figure 3 Applying a filter in Xcalar Design to show only errors rows

The resultant table displays only those rows that contain errors, as shown in Figure 4. Widening the error column displays the full error message.

Figure 4: Resultant CD_catalog Dataset Table after running a Filter Operation

The corresponding error rows are identified in the raw data by red arrows in Figure 5.

Figure 5: Empty Opening and Closing Element Tags of Nested Elements

This issue can be handled in the simpleParse UDF with an exception handler.

Now that the issue is identified a sort operation is run on the lineNum column resulting in the following:

  • The order is now visible and correct.
  • The hierarchy is not directly visible, but can be derived with a start-tag and end-tag sequence.
  • The error rows will be handled in the UDF by an exception handler.
  • The payload-tag indicates a column name if the data is transposed into rows.
  • The payload-data contains a column value if the data is transposed into rows.

Figure 6: Resultant CD_catalog Dataset Table after running a Sort Operation

Enhancing the simpleParse UDF

This section shows how you can enhance the simpleParse UDF for larger sets of XML data.

The simpleParse UDF is enhanced to transform the raw data into rows containing these columns:

  • Title
  • Artist
  • Country
  • Company
  • Price
  • Year

Figures 7 and 8 show the resultant tables after adding enhancements to the simpleParse UDF code, where in Figure 7 the UDF is applied on a larger set of XML data.

Figure 7: Resultant CD_catalog Dataset Table after enhancement 1

Figure 8: Resultant CD_catalog Dataset Table after enhancement 2

This dataset is now ready for analysis, blending, joining, reporting, and whatever else you may wish to use it for.

simpleParse UDF script

The simpleParse UDF leverages a popular open-source python library. For more information, see https://github.com/martinblech/xmltodict.

In order to achieve the resultant tables shown in this article, a single Xcalar import UDF module was created. The UDF module contains two functions:

  • simpleParse
  • constructRow

Example 3:

# xml_parsing_demo
# simpleParse will just feed back the XML data as it is read in, along with some additional metadata to help understand the input stream.
# It will not attempt to assemble multiple tags into a record
# constructRow will accumulate multiple elements into a single row, then yield the row when the closing </CD> tag is encountered
# this is intended to be used with s3://xcfield/users/allan/cd_catalog.xml as data source
# this requires xmltodict to be installed on each node of the Xcalar cluster
import sys
import xmltodict
def simpleParse(inpath, inStream):
lineNum=0
for line in inStream:
lineNum += 1
#
try:
inRec = xmltodict.parse(line)
for field in inRec:
yield {"lineNum": lineNum, "payload-tag": field, "payload-data": inRec[field], "inputLine": line}
except:
e = sys.exc_info()
yield {"lineNum": lineNum, "error": "Error {}".format(e), "inputLine": line}
 
def constructRow(inpath, inStream):
lineNum=0
accumRow = {"source-file": inpath}
for line in inStream:
lineNum += 1
lineLen = len(line)
#
try:
inRec = xmltodict.parse(line)
for field in inRec:
#
if field == "TITLE":
accumRow["Title"]= inRec[field]
elif field == "ARTIST":
accumRow["Artist"]= inRec[field]
elif field == "COUNTRY":
accumRow["Country"]= inRec[field]
elif field == "COMPANY":
accumRow["Company"]= inRec[field]
elif field == "PRICE":
accumRow["Price"]= inRec[field]
elif field == "YEAR":
accumRow["Year"]= inRec[field]
#the closing </CD> will be caught by the exception handler because it has no associated field
except:
e = sys.exc_info()
#
if "</CD>" in line:
yield accumRow
accumRow = {"source-file": inpath} # reset the accumulator inRec
else:
# enable the following yield line if you wish to see the errors from the lines that have only tags, no elements
# yield {"lineNum": lineNum, "lineLen": lineLen, "field": "parseError", "error": "Error {}".format(e), "inputLine": "|" + line + "|"}
pass
# bottom

Final notes:

The examples presented here demonstrate one method for XML parsing. This example does not cover complex scenarios, such as:

  • arrays
  • malformed xml
  • characterset issues
  • hidden unprintable characters
  • excessive length of xml entries/files

Go to top