Preparing the data source

To use Xcalar reliably and predictably, you must make sure that your data source meets the requirements described in this topic. You might need to prepare your data files so that Xcalar Design can properly extract the metadata from them and can present the data in a readable format appropriate for data modeling.

Importing multiple files

The data source can be a file or a directory. If you import a directory, Xcalar Design imports data in all files in the directory, unless a search pattern is specified. You can also import subdirectories recursively.

If you want to create a single dataset from multiple files, place the files in one directory.

For example, suppose you want to import airline data gathered in 2015 and 2016, which is in two files. To create a dataset with records from both files, follow one of these steps:

  • Create a directory named AirlinesData and place both files in the directory. Then import the AirlinesData directory.
  • Give names to the files such as airlines2015.csv and airlines2016.csv. Then instruct Xcalar Design to import only these files by specifying a search pattern. Only files with names matching the pattern will be processed.

    You can use an asterisk (*) as a wild card character in the search pattern. In this example, you can specify the following pattern:

    airlines201*.csv

    Alternatively, use a regular expression (regex) pattern. In this example, you can specify the following regex pattern:

    airlines201.\.csv

    Preparing the source files by renaming enables you to create one dataset with records from only airlines2015.csv and airlines2016.csv. If other files such as Airlines2000.csv or airlines2015.xls exist in the same directory, Xcalar Design disregards them as their names do not match the specified pattern.

Verifying the file format or schema

If you import a directory or multiple files, make sure that the source files meet the following requirements to avoid unpredictable operation results:

  • All files are in the same format. For example, they must be all JSON files or XML files.
  • If the file format is CSV, the files must have the same schema.

To instruct Xcalar Design to import the files that meet the requirements, use file name pattern matching to limit the files to be processed.

Creating a first row as column headers

NOTE: The information about creating a first row as column headers is not applicable if the data to be imported is in JSON format.

Xcalar recommends that you use the values on the first row of your data file as column headers. If the values are not appropriate as column headers, create a first row, which you can promote as column headers. For information about naming conventions for column headers, see Preparing the first row as column header. (If you do not provide the column headers, Xcalar Design uses column0, column1, and so on, as default column headers.)

IMPORTANT: If you import a directory, the first row created must be the same for each file in the directory. Xcalar Design processes data files in parallel; the order of the files being processed is not predetermined. If all the files contain the same first row, the desired column headers will be used when Xcalar Design promotes the first row.

Column headers improve readability of the tables created by Xcalar Design. In addition, if you use an import UDF, be aware that after Xcalar Design imports the data, the table columns are ordered differently than the columns in the data source. (A UDF for parsing data when Xcalar Design imports the data is called an import UDF.) If the columns do not have headers, you might not be able to determine the meaning of each column based on its position.

Example: Your data source is an Excel file with ID numbers in its first column. When you import it, Xcalar Design invokes a UDF to parse the data. If the file does not have headers, it might be difficult to determine which column is for ID numbers because the column is no longer the first column in the table displayed in Xcalar Design. If, before parsing, you create a header row as the first row of the Excel file, and assign a name to the first column (for example, ID), then you can easily locate the column in Xcalar Design.

Preparing the first row as column header

Before you promote the first row, make sure that the first row of the source file does not contain illegal characters. If there are illegal characters, edit the first row to remove the illegal characters. Alternatively, you can import the data source by using a user-defined function (UDF) provided by Xcalar to replace these characters.

In addition, make sure that the column headers based on the first row follow the Xcalar Design naming conventions. Otherwise, you might see unexpected operation results.

Determining the quoting character to use in CSV or text files

If your data source is a CSV or text file, check to see if the data source contains a double quote mark (") that should be interpreted literally. By default, Xcalar Design considers the quote mark as a character to surround a string value. If the quote mark in your data source should be parsed literally, you must override Xcalar Design's default by specifying no character as the quoting character.

Unicode encoding

Unicode is supported for both delimiter-separated values and JSON that is UTF-8 encoded. UTF-16 or other encoding systems are not supported without user-defined functions (UDFs).

Next step

Import a data source by following the instructions in one of these topics:

Easy steps for importing data to Xcalar Design (with default settings)

Detailed steps for importing data to Xcalar Design (with custom settings)

Go to top