Detailed steps for importing data to Xcalar Design (with custom settings)

This section provides detailed information about the steps up to the point where you can start performing operations in tables presented in a Xcalar Design workbook. Read this section if you want to better understand the fields that are displayed when you start using Xcalar Design and if you want to override default settings when importing your data source.

IMPORTANT: You must follow the instructions in Preparing the data source before you start using Xcalar Design to import data.

Starting Xcalar Design

Follow these steps to start Xcalar Design:

  1. Point the browser using the HTTP or HTTPS protocol to the location where Xcalar Design is installed. For example, enter the following URL:

    https://xcalar.mycompany.com/index.html

  2. Log in by using the login name and password given to you by the Xcalar administrator.

Before creating a workbook as your workspace

Before creating a book, decide on the workbook name. Workbook names can be changed at any time.

Creating a workbook as your workspace

Follow these steps to create a workbook:

  1. The Workbook Browser is automatically displayed after you log in to Xcalar Design for the first time. At other times, you can display or dismiss the Workbook Browser by clicking in the upper left corner of the window. In the Workbook Browser, enter the workbook name and click CREATE WORKBOOK on the New Workbook card.

    After the workbook is created, it contains one worksheet named Sheet 1.

  2. The newly created workbook is inactive. To activate it, click , which is the first icon on the menu when you hover over the workbook card. (An active workbook is the one you can work on, and you can have only one active workbook at a time.)

For more information about the Workbook Browser, see Workbook Browser.

Importing data from a data source

Follow these steps to import data:

  1. Click (Datasets icon). Its location is shown in the following screenshot:

  2. In the Import Data Source window, select the protocol for accessing the data source. The protocol can be HDFS, file, or S3. The following screen illustrates the Import Data Source window.

  3. Select the path to the data source as follows:

    1. If you select file:///, enter the complete path to a directory in the Data Source Path field as in the following example:

      /mydata/airlines

      Xcalar Design accepts a path with or without the preceding forward slash (/).

      If you select hdfs://, enter the host name followed by a slash (/) in the Data Source Path field. The slash enables you to browse the root directory of the specified host. For example, enter the following path for the root directory on host1:

      host1/

      Alternatively, to browse the contents of a directory other than the root directory, enter the complete path to that directory, as in the following example:

      host1/datasets

    2. Click BROWSE to display the Browse Data Source window, which lists the contents of the directory selected in the previous step. Navigate by double clicking the directory icons until you see the desired source data file or directory. Click the file or directory icon to select it as your data source.

      TIP: If you have a large number of files, use the Regex Search field to filter out irrelevant files. For example, typing the regular expression us(t|a) filters out all files except these two: usability and justin.

      Optionally, you can right click a file icon in this window and then click Preview to preview the raw data before selecting the file. You can preview the file in ASCII format or as a hex dump.

      NOTE: No previewer is available for directories.

      The following screenshots explain how to preview a file named airlines.csv.

  4. Click NEXT. The top portion of the Import Data Source window displays a preview of the source data in a tabular format.

    If the data source is a directory, the records in the preview are from the first file in the directory.

    You can position your mouse on a column header and drag it to resize each column. You can also use your mouse to scroll the rows.

  5. (Optional) If you import a directory and you want to preview records from other files in the directory, follow these steps:

    1. Click Change to display the Select File to Preview modal window.
    2. Select the file to preview.
    3. Click PREVIEW.
  6. (Optional) Specify values for the advanced options as described in About advanced options. The following screenshot shows the location of the options. In most cases, you can skip the advanced options and go on to the next step.

  7. In the Dataset Name field, accept the suggested name or type the desired name. The name cannot be changed later. The dataset enables you to create tables from the data source. For more information about datasets, see Managing datasets.

  8. If you want Xcalar Design to invoke an import UDF when parsing the data in the data source, select Parse Data With UDF and then specify the module name and function name.

    You can preview the results of the import UDF by clicking refresh preview. If you see unexpected results, you can select another UDF or no UDFs.

    For more information about UDFs, see Understanding UDFs and Using UDFs. If no UDFs are needed for parsing the data, skip to the next step.

    IMPORTANT: Parsing a data source with an import UDF is required if your data source contains illegal characters. For a list of these characters, see Preparing the first row as column header.
  9. The fields in the rest of the Import Data Source window tell Xcalar Design about the data format and layout. Xcalar Design automatically detects the format and layout of the data source, but you can override the default values.

    TIP: After you enter values in these fields, if you want to see the values recommended by Xcalar Design again, click REDETECT FORMAT.

    Depending on the format of the data source, some fields might not be available. For example, for the Excel format, only Promote First Row as Header is available. The following table describes the fields:

    Field Description
    Format

    It can be separated values, JSON, Excel, or Text.

    NOTE: If the data source is an Excel file, only the data in the first worksheet of the Excel workbook is imported.
    Promote First Row as Header

    Depending on the data format, Xcalar Design automatically selects or deselects this option. For example, if the data source is a CSV file, Xcalar Design recommends that you promote the first row as header.

    When importing multiple files with the same first row, Xcalar Design promotes the first row of the first file being processed and omits the first row of subsequent files.

    If you use the Skip Rows field to skip rows that do not require analysis, Xcalar Design promotes the first row after the row-skipping. 

    EXAMPLE: The a file contains comments in the first row, and the strings Name and Salary in cells on the second row. You can specify 1 in the Skip Rows field to eliminate the first row and promote the row containing Name and Salary as header.
    IMPORTANT: You can promote the first row only at this point. You will not be able to change the first row into the header row at a later time.
    Record Delimiter The UTF-8 character for separating records (for example, \n or a null character). It must be a single character, unless it is CRLF (\r\n). If your data uses a multiple-character delimiter, you must use an import UDF when importing the data.
    Field Delimiter The UTF-8 character separating fields (for example, a tab or a comma). The delimiter can consist of multiple characters. The maximum length is 255 bytes.
    Quoting Character The character surrounding a string in the data source. The double quote mark (") is the default quoting character in Xcalar Design. If the data source uses another character to enclose a string, enter the character in this field. If your file does not use any character to denote a string, leave this field blank.
    Skip Rows

    The number of rows to omit at the beginning of each file when Xcalar imports the data source. Typically you omit rows that do not contain data requiring analysis (for example, rows containing comments). If you import a directory with multiple files, the same number of rows are skipped in each file.

    If you use a UDF when importing a data source, the UDF parses the data before the rows are skipped.

    NOTEXcalar Design automatically displays a new preview after you change the value of a field. The preview helps you determine whether the change creates the desired table format. If not, click REDETECT FORMAT to restore all default values.
  10. Click CREATE DATASET. The Dataset Preview window is displayed.

  11. In the Dataset Preview window, click Select All to select all the columns and then click CREATE TABLE. Xcalar Design creates a table with all the columns from the data source.

    NOTE: Under certain circumstances, not all the columns are included automatically in the Dataset Preview. For example, if a dataset contains 62 rows and the last 2 rows have a column that the first 60 rows do not have, this column is not included in the table in this step. You can, however, add this column later. For information about adding columns, see Adding columns from a dataset to a table.

You now have a dataset referencing your data source. The dataset is listed in the Datasets section of the left panel.

About advanced options

Advanced options are provided when you select a data source. The options enable you to accomplish the following tasks:

  • import files with names matching the specified pattern
  • import files recursively
  • set the dataset size
NOTE: After using the advanced options, you can click refresh preview to see the effect of the option settings.

Pattern matching

To import files with names that match a particular string pattern, type the string with the wild card character in the Pattern field. The wild card character is an asterisk (*). For example, to import all files with names ending with csv, type *csv in this field.

Alternatively, you can enter a regular expression in the Pattern field and select the regex option. For example, you can enter the regular expression sellers(1|2|3) to import sellers1info.csv, sellers2data.csv, and sellers3.Info.csv, but not serllers4.csv.

Recursive importing

If you import a directory, you can traverse all subdirectories to import all files in the subdirectories.

You can use pattern matching in conjunction with recursive importing to limit the files whose names match the specified pattern.

Example of recursive importing with pattern matching (not regex)

This example illustrates how to go through directories and find files with a matching pattern (without using regex). Suppose the data source directory selected is file:///flightData, and you want to create a dataset from the following files:

  • /flightData/airlines2015.csv
  • /flightData/airlines2016.csv
  • /flightData/currentYear/airlines2017.csv

In the Pattern field, specify airlines201*.csv, and then select Recursive. Xcalar Design traverses all subdirectories under /flightData to locate files whose name matches airlines201*.csv and then creates a dataset with records from those files.

Example of recursive importing with pattern matching (regex)

This example illustrates how to go through directories to find files with a matching regex pattern. Suppose the data source directory selected is file:///flightData, and you want to create a dataset from the following files:

  • /flightData/airlines2015.csv
  • /flightData/airlines2016.csv
  • /flightData/currentYear/airlines2017.csv

In the Pattern field, specify .*airlines201.\.csv. Then select Recursive.

Dataset size

The Dataset Size field limits the size of the dataset created by Xcalar Design when it imports a data source. If importing a data source results in a dataset that exceeds the setting of this field, the attempt to import the data source fails with an error message.

NOTE: Regardless of the setting, if the size of the smallest file in the data source exceeds this limit, the attempt to import the data source fails. The dataset created by Xcalar Design must import at least one file in the data source. It cannot import a partial file.

The following list describes the options for this field:

  • Entire file/folder: If this option is selected, the maximum dataset size is 1 TB, by default. Due to the high limit, it is likely that Xcalar Design can import all the data in the entire data source, whether it is a file or a folder. This option is particularly useful for interactive analysis.

    If you are not the cluster administrator, consult your administrator to determine the current maximum dataset size. If you are the administrator, you can set the value of the MaxInteractiveDataSize parameter through the Setup icon of the Monitor. For information about setting parameter values, see Configuring parameters.

    EXAMPLE:  If your data source is a 900-GB file or a folder consisting of files that add up to 900 GB, Xcalar Design can import all data in the data source because the dataset size is below the 1-TB limit. If your data source is a 1.2 TB file, the attempt to import it fails. If your data source is a folder consisting of files that add up to 1.2 TB, Xcalar Design imports as many files in the folder as possible until the 1-TB limit is reached, provided that the smallest file in the folder is less than 1 TB.
  • Custom size: Typically, you use this option when you import a folder. When you set a dataset size limit, Xcalar Design creates a dataset from as many files as allowed by the limit. If you import a data source that is a single file, the file must be under the specified limit, or the attempt to import fails. This option is particularly useful for previewing raw data and modeling.

    The factory default for this field is 10 GB. You can set another default value by using the DATASET SIZE field of the MonitorGeneral Settings window. The value for the DATASET SIZE field is automatically shown in the Custom Size field each time you import data.

    If importing a data source requires a size greater than the size shown in the Custom size field, enter the desired value to replace the default value.

    EXAMPLE:  Suppose the dataset size is left at the default (10 GB). If your data source is a 5-GB file or folder, you can import all the data in the data source. If the data source is a 15-GB file, the attempt to import fails. If the data source is a 15-GB folder, Xcalar Design imports as many files as allowed by the 10-GB limit, provided that the smallest file is smaller than 10 GB.
    EXAMPLE: Suppose you use the Monitor to change the default dataset size to 20 GB. Each time you import a data source, the Custom size field automatically shows the default you set (20 GB). If your dataset needs a greater size, enter the desired value. For example, to be able to import a 30-GB file, you must enter 30 GB or a greater value.

You now have a dataset referencing your data source. The dataset is listed in the Datasets section of the left panel.

Next step

To create a table from a dataset, see Creating a table.

To perform tasks associated with datasets, see Managing datasets.

NOTE: If you prefer to create a dataset and a table at the same time in the future, you can enable an option in the General Settings window. For information about changing a setting, see Changing the environment settings for Xcalar Design.

Go to top