|Creation date||Minimum version|
Xcalar allows you to import very large amounts of data efficiently. This article explains how to design for a high throughput data import task:
- Preparing source data files
- Monitoring the system resource distribution
You want to import over a billion rows of foreign exchange (FOREX) data quickly and efficiently into Xcalar for analysis. How would you approach this problem with the FOREX data source described in table 1 and a Xcalar cluster configuration as described in table 2.
Table 1 describes the FOREX data source.
Table 1: FOREX Dataset Description
|File Content||FOREX Data|
|Number of records||1.8 billion|
|Number of columns||Fixed, 4 columns|
|Size of dataset||88 GB|
Table 2 describes the Xcalar cluster configuration. There are 32 cores per host, which amounts to a total of 128 cores available on the 4-node cluster.
Table 2: Xcalar Cluster Setup
|Cluster Size||4 hosts|
|Instance (if cloud)||c3.8xlarge|
|Memory||64 GB per host|
|Cores||32 per host|
|File System||SAN/NAS, HDD/SSD, HDFS, AWS S3|
The following two sections discuss how you can accomplish the two objectives described earlier.
Preparing the Data Source Files
Preparing the data source files for high throughput data import.
With Xcalar, the number of concurrent files being processed for a read function are proportional to the number of cores and the number of nodes. Following is a recommended guideline:
Code Block 1: Planning for Number of Files
- numFiles is the total number of files available to import on the cluster.
- numCoresPerNode is the number of cores on each node of the cluster.
- numNodes is the number of nodes in the cluster.
Based on the above guidelines, numFiles must be at least 384.
Here are additional recommendations:
- Each file should be roughly the same size, in order to spread the import workload evenly across cluster nodes and cores.
- Each file must be at least 1 MB in size.
Uncompress or extract individual compressed files or file archives, by using the standard UNIX commands, such as
tar xvzf *.tar.gz.
Splitting the Files
There is no optimal file size that Xcalar recommends for loading. You do need to be aware that a single file is processed completely by one core. So, if for example you have 10 large files and 32 cores, then only 10 cores are used, the other 22 cores will remain idle. Split your large files into smaller chunks to eliminate such scenarios. For example:
Code Block 2: Splitting a large file into smaller files that contain a maximum of 10,000 lines
The above command splits a large file into smaller files across record boundaries, where each file will have 10,000 records. The very last file will have less than or equal to 10,000 records.
If your file system block size is 4KB, the read() function will read chunks of 4KB. One read call can only read from one file at a time, so it requires more I/O operations per second (IOPS), if the file is smaller. In this case, if you want to reduce the number of IOPS, Xcalar recommends that you have files bigger than 4KB. For example:
Code Block 3: Combining multiple smaller files into one larger file
You may need to write a shell script that combines multiple groups of smaller files into larger files.
Disproportionate Record Size
In certain cases, you may see a high variance in record length across records. This may not be just due to a variable number of columns, but also due to a variance in the size of the column. In such cases you can use the following command:
Code Block 4: Splitting a large file into 128 smaller files across record boundaries
Splitting the Data Files
Let’s say you have a single CSV file (forex.csv), with 1,502,030,421 records and a record delimiter of \n. If you need 256 roughly equal sized files, each having:
384 = 3911537 records.
The last split file will not have 3911537 records, but instead have:
1502030421 % 384 = 213 records.
Now you have enough information to run the split command:
Code Block 5: Splitting a large file into smaller files that contain 5,867,306 lines
The above command splits (on record boundary) the forex.csv file into individual files, each containing
Disproportionate File Size
In some cases you may have to deal with disproportionate file sizes or a disproportionate number of records across files. Both of these can result in skew. For example:
Code Block 6: Listing file sizes in bytes
could show you an output that looks like:
Code Block 7: Output of list file size
In Code Block 7, you see that the file sizes are not the same. This will keep some cores relatively idle.
Xcalar auto-indexes the entire table, which gives a unique key to each row in the table.
For cloud storage options, such as AWS S3, you will need to copy all your split files to a single object, for example,
datasets/forex in an AWS S3 bucket, called xcfield. From Xcalar Design, in the Import Data Source field, choose s3:// for your Data Source Protocol, as shown in Figure 1, and browse to the path:
Code Block 8: S3 object key path
to view all the CSV part files.
Figure 1: Browse Data Source for AWS S3
Shared File Systems
For NAS and SAN, you will need to copy all your split files to a shared file system, for example:
Code Block 9: Shared file system directory for your datasets
From Xcalar Design, in the Import Data Source field, choose file:// for your Data Source Protocol, as shown in Figure 2, and browse to the Code Block 9 file path, to view all the CSV part files.
Figure 2: Browse Data Source for SAN/NAS
In case of HDFS, you will need to copy the files to an HDFS cluster, and use hdfs:// as your Data Source protocol.
Monitoring the System
In the previous section, we went over the guidelines for preparing your data source files. Minimizing skew can result in higher throughput. Skew can be characterized in terms of non-uniform usage of resources, such as CPU, Memory, Network, or I/O across the cluster. You can use the Xcalar Design Monitor function to get a breakdown on basic metrics that will help you detect any such skew.
Distribution of CPU usage across hosts in the cluster should be uniform. If the Xcalar Design Monitor function shows you a CPU of 0.69% as shown in Figure 3, this is not a good sign and may indicate a potential skew.
Figure 3: CPU as shown in Xcalar Monitor
You can then drill down and use htop as shown in Code Block 10 to investigate the individual cores.
Code Block 10: htop UNIX command for displaying CPU, Memory, and other meters
In Figure 4, you observe a single core spinning at 100% and all other cores are idle, which may give a clue on the root cause. Often this occurs when you are only loading one source data file per node.
Figure 4: htop showing CPU skew across cores
Network activity should be uniform across all the hosts on the cluster. The Xcalar Design Monitor function in Figure 5 shows the network usage statistics.
Figure 5: Network usage using Xcalar Monitor
You can also use
iftop, and get more details as shown in Code Block 12.
Code Block 11: UNIX command to fetch interface statistics
could show you an output that looks like:
Code Block 12: Output of iftop
Here, various totals are shown, including peak traffic over the last 40 seconds, the total traffic transferred (after filtering), and the total transfer rates averaged over 2 seconds, 10 seconds, and 40 seconds. Most of this traffic is in kilobytes, which indicates very little network traffic if you were transferring > 1 GB sized files.
In case of shared file systems, such as NAS/SAN/HDFS/S3, there is little possibility of memory skew.
Larger file sizes increase I/O throughput. Many small files (typically in KB) can increase IOPS (I/O operations per second), which is not desirable. Increasing the amount of large files, increases CPU throughput until there is a point where you can see I/O throughput at its maximum.
Code Block 13: UNIX command to get Read and Write I/O information
would show you an output that looks like:
Code Block 14: Output of iotop command
This output indicates a lack of any I/O on this node, which could indicate that no data is being read here.
For more information on the I/O statistics, run the following command:
Code Block 15: UNIX command to get IO statistics
which displays in the first column the transfers per second (tps), as shown in Code Block 16:
Code Block 16: Output of iostat command
This shows you the number of actual transfers to disk, also known as IOPS.
In this article you have learned how to design for a high throughput data import using Xcalar Import. You have learned some guidelines related to planning for a maximize Xcalar throughput based potential and for detecting any skews involved during the import process.