Creation date Minimum version
2017-8-21 1.2.1

High Throughput Data Import

Objectives

Xcalar allows you to import very large amounts of data efficiently. This article explains how to design for a high throughput data import task:

  1. Preparing source data files
  2. Monitoring the system resource distribution

Problem

You want to import over a billion rows of foreign exchange (FOREX) data quickly and efficiently into Xcalar for analysis. How would you approach this problem with the FOREX data source described in table 1 and a Xcalar cluster configuration as described in table 2.

Table 1 describes the FOREX data source.

Table 1: FOREX Dataset Description

Dataset Property Value
File Content FOREX Data
File format CSV
Number of records 1.8 billion
Number of columns Fixed, 4 columns
Size of dataset 88 GB

Table 2 describes the Xcalar cluster configuration. There are 32 cores per host, which amounts to a total of 128 cores available on the 4-node cluster.

Table 2: Xcalar Cluster Setup

Property Value
Cluster Size 4 hosts
Instance (if cloud) c3.8xlarge
Memory 64 GB per host
Cores 32 per host
File System SAN/NAS, HDD/SSD, HDFS, AWS S3
Network 10 Gb/s

The following two sections discuss how you can accomplish the two objectives described earlier.

Preparing the Data Source Files

Preparing the data source files for high throughput data import.

Overview

With Xcalar, the number of concurrent files being processed for a read function are proportional to the number of cores and the number of nodes. Following is a recommended guideline:

Code Block 1: Planning for Number of Files

numFiles >= numCoresPerNode * 3 * numNodes

where:

  • numFiles is the total number of files available to import on the cluster.
  • numCoresPerNode is the number of cores on each node of the cluster.
  • numNodes is the number of nodes in the cluster.

Based on the above guidelines, numFiles must be at least 384.

Here are additional recommendations:

  • Each file should be roughly the same size, in order to spread the import workload evenly across cluster nodes and cores.
  • Each file must be at least 1 MB in size.

Removing Compression

Uncompress or extract individual compressed files or file archives, by using the standard UNIX commands, such as tar xvzf *.tar.gz.

Splitting the Files

Large Files

There is no optimal file size that Xcalar recommends for loading. You do need to be aware that a single file is processed completely by one core. So, if for example you have 10 large files and 32 cores, then only 10 cores are used, the other 22 cores will remain idle. Split your large files into smaller chunks to eliminate such scenarios. For example:

Code Block 2: Splitting a large file into smaller files that contain a maximum of 10,000 lines

split -l 10000 largefile.csv

The above command splits a large file into smaller files across record boundaries, where each file will have 10,000 records. The very last file will have less than or equal to 10,000 records.

Small Files

If your file system block size is 4KB, the read() function will read chunks of 4KB. One read call can only read from one file at a time, so it requires more I/O operations per second (IOPS), if the file is smaller. In this case, if you want to reduce the number of IOPS, Xcalar recommends that you have files bigger than 4KB. For example:

Code Block 3: Combining multiple smaller files into one larger file

cat tinyfile1.csv tinyfile2.csv tinyfile3.csv > largerfile1.csv

You may need to write a shell script that combines multiple groups of smaller files into larger files.

Disproportionate Record Size

In certain cases, you may see a high variance in record length across records. This may not be just due to a variable number of columns, but also due to a variance in the size of the column. In such cases you can use the following command:

Code Block 4: Splitting a large file into 128 smaller files across record boundaries

split -n -l/128 Largefile.csv

Splitting the Data Files

Let’s say you have a single CSV file (forex.csv), with 1,502,030,421 records and a record delimiter of \n. If you need 256 roughly equal sized files, each having:

1502030421
384 = 3911537 records.

The last split file will not have 3911537 records, but instead have:

1502030421 % 384 = 213 records.

Now you have enough information to run the split command:

Code Block 5: Splitting a large file into smaller files that contain 5,867,306 lines

split -l 3911537 forex.csv

The above command splits (on record boundary) the forex.csv file into individual files, each containing 3,911,537 records.

Disproportionate File Size

In some cases you may have to deal with disproportionate file sizes or a disproportionate number of records across files. Both of these can result in skew. For example:

Code Block 6: Listing file sizes in bytes

xcalar@node1:/datasets/forex$ ls -l

could show you an output that looks like:

Code Block 7: Output of list file size

-rw-rw-r-- 1 xcalar xcalar 72695083 Aug 29 17:56 xaa
-rw-rw-r-- 1 xcalar xcalar 72695041 Aug 29 17:56 xab
-rw-rw-r-- 1 xcalar xcalar 7267 Aug 29 17:56 xac
-rw-rw-r-- 1 xcalar xcalar 72694982 Aug 29 17:56 xad
-rw-rw-r-- 1 xcalar xcalar 72694993 Aug 29 17:56 xae
-rw-rw-r-- 1 xcalar xcalar 7269 Aug 29 17:57 xaf
-rw-rw-r-- 1 xcalar xcalar 7269 Aug 29 17:57 xag
-rw-rw-r-- 1 xcalar xcalar 7269 Aug 29 17:57 xah
-rw-rw-r-- 1 xcalar xcalar 72694966 Aug 29 17:57 xai
-rw-rw-r-- 1 xcalar xcalar 7269 Aug 29 17:57 xaj
-rw-rw-r-- 1 xcalar xcalar 72695102 Aug 29 17:57 xak
-rw-rw-r-- 1 xcalar xcalar 72694943 Aug 29 17:57 xal

In Code Block 7, you see that the file sizes are not the same. This will keep some cores relatively idle.

Indexing

Xcalar auto-indexes the entire table, which gives a unique key to each row in the table.

Cloud Storage

For cloud storage options, such as AWS S3, you will need to copy all your split files to a single object, for example, datasets/forex in an AWS S3 bucket, called xcfield. From Xcalar Design, in the Import Data Source field, choose s3:// for your Data Source Protocol, as shown in Figure 1, and browse to the path:

Code Block 8: S3 object key path

s3://xcfield/datsets/forex/

to view all the CSV part files.

Figure 1: Browse Data Source for AWS S3

Shared File Systems

For NAS and SAN, you will need to copy all your split files to a shared file system, for example:

Code Block 9: Shared file system directory for your datasets

/freenas/datasets/DataBlendingDemo/data/csv/tradeData

From Xcalar Design, in the Import Data Source field, choose file:// for your Data Source Protocol, as shown in Figure 2, and browse to the Code Block 9 file path, to view all the CSV part files.

Figure 2: Browse Data Source for SAN/NAS

In case of HDFS, you will need to copy the files to an HDFS cluster, and use hdfs:// as your Data Source protocol.

Monitoring the System

In the previous section, we went over the guidelines for preparing your data source files. Minimizing skew can result in higher throughput. Skew can be characterized in terms of non-uniform usage of resources, such as CPU, Memory, Network, or I/O across the cluster. You can use the Xcalar Design Monitor function to get a breakdown on basic metrics that will help you detect any such skew.

Monitoring CPU

Distribution of CPU usage across hosts in the cluster should be uniform. If the Xcalar Design Monitor function shows you a CPU of 0.69% as shown in Figure 3, this is not a good sign and may indicate a potential skew.

Figure 3: CPU as shown in Xcalar Monitor

You can then drill down and use htop as shown in Code Block 10 to investigate the individual cores.

Code Block 10: htop UNIX command for displaying CPU, Memory, and other meters

htop

In Figure 4, you observe a single core spinning at 100% and all other cores are idle, which may give a clue on the root cause. Often this occurs when you are only loading one source data file per node.

Figure 4: htop showing CPU skew across cores

Monitoring Network

Network activity should be uniform across all the hosts on the cluster. The Xcalar Design Monitor function in Figure 5 shows the network usage statistics.

Figure 5: Network usage using Xcalar Monitor

You can also use iftop, and get more details as shown in Code Block 12.

Code Block 11: UNIX command to fetch interface statistics

sudo iftop

could show you an output that looks like:

Code Block 12: Output of iftop

TX: cum: 30.6kB peak: 44.7kb rates: 3.49kb 13.9kb 15.3kb
RX: 12.8kB 16.7kb 3.23kb 7.56kb 6.42kb
TOTAL: 43.4kB 52.3kb 6.73kb 21.4kb 21.7kb

Here, various totals are shown, including peak traffic over the last 40 seconds, the total traffic transferred (after filtering), and the total transfer rates averaged over 2 seconds, 10 seconds, and 40 seconds. Most of this traffic is in kilobytes, which indicates very little network traffic if you were transferring > 1 GB sized files.

Monitoring Memory

In case of shared file systems, such as NAS/SAN/HDFS/S3, there is little possibility of memory skew.

Monitoring I/O

Larger file sizes increase I/O throughput. Many small files (typically in KB) can increase IOPS (I/O operations per second), which is not desirable. Increasing the amount of large files, increases CPU throughput until there is a point where you can see I/O throughput at its maximum.

Code Block 13: UNIX command to get Read and Write I/O information

sudo iotop

would show you an output that looks like:

Code Block 14: Output of iotop command

Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init
2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
3 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
7 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_sched]
8 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_bh]
9 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]

This output indicates a lack of any I/O on this node, which could indicate that no data is being read here.

For more information on the I/O statistics, run the following command:

Code Block 15: UNIX command to get IO statistics

iostat

which displays in the first column the transfers per second (tps), as shown in Code Block 16:

Code Block 16: Output of iostat command

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 5.53 50.19 88.44 288255266 507932413

This shows you the number of actual transfers to disk, also known as IOPS.

Summary

In this article you have learned how to design for a high throughput data import using Xcalar Import. You have learned some guidelines related to planning for a maximize Xcalar throughput based potential and for detecting any skews involved during the import process.

Go to top