January 25, 2018

Test Driving Xcalar Design for Ad Hoc Analytics

This is the first article of a new blog series that will demonstrate the power of Xcalar Data Platform for ad hoc analytics. I will demonstrate how quickly I could obtain insights from public available data that contains hundreds of millions of rows.

For this exercise I will showcase Xcalar Design’s visual analysis capability, using publicly available San Francisco employee salary data from https://www.kaggle.com/kaggle/sf-salaries, containing just under 150,000 records.

Not knowing exactly what was in the data, I decided on a few exploratory questions that would display Xcalar Design’s rich set of operational functionality.  Here are the questions:
  1. During which year were San Francisco city employees paid the highest and the lowest salaries?
  2. How many police officers are there in the dataset and what is the mean and standard deviation of their pay?
  3. What does the distribution of salaries of full-time employees look like?
  4. Is there any correlation between an employee’s base pay and overtime pay?
And here are some of the tasks that will help find the answers:
  • Importing datasets into Xcalar
  • Data casting
  • Using the Group by operator
  • Working with correlation, filtering, and profile functions
Before I could analyze the data to answer any of my questions, I first needed to import the data into Xcalar Design, which creates a new dataset table from the data source. Since this dataset is stored in a comma-separated values (CSV) file, one of the many formats that Xcalar supports, no additional scripts were necessary to import the file into the dataset table. Xcalar Design was able to parse this file as is.

I really like the fact that I can choose which columns of data I want to analyze before the dataset table is created and that if I missed or require data that I had deemed unimportant, I can go back and add the column or columns of data into the table without starting from scratch. Once I had the dataset table to my liking, I then did a bit of cleanup by setting the correct data type for each column of data with Xcalar Design’s SMART CAST panel feature.

Question 1: Determining the highest and the lowest paying salary years

I was now ready to manipulate the data and figure out the answer to my first question, which is to find the year in which San Francisco employees were paid the highest salary.

Step 1.1: Group the rows by year. This is a good place to use a Group By operation. To get the average payments by year, in the Salaries table, I applied the avg function to the TotalPay column from the GROUP BY panel.
Figure 1: GROUP BY panel
The Group By operation created a new table, called Average_Salary_By_Year, with two columns called Year_avg and Year, as shown in Figure 2.
Figure 2: Resultant table containing new Year_avg and Year columns
Step 1.2: Sorting the rows by the average salary will tell me which year the employees had the highest average income.
In the Average_Salary_By_Year table from the previous step, I ran a descending order Sort operation on the Year_avg column, as shown in Figure 3.
Figure 3: Sorting a column in descending order

Answer

From the results, which are shown in Figure 4, you can see that on average, employees were paid the highest salary in 2013 and the lowest salary in 2011.
Figure 4: Highest and lowest salary years

Question 2: Determining the mean and standard deviation of Police Officer salaries

For this task I wanted to find more specifics on Police Officer salaries.

Step 2.1: Filter the rows based on job titles

For this step, I only wanted to work with the rows whose job titles contain Police in the string. So using a filter operation with the FILTER panel, I reduced the amount of rows to those that contain the case-insensitive string Police. As shown in Figure 5.
Figure 5: FILTER panel
Step 2.2: Statistical analysis
With the data filtered, I could now analyze the remaining records. For this step, I used the Profile operation to visualize the police officer data. From the Profile modal window, under the Aggregate Summary section, I was able to view the desired information, as shown in Figure 6.
Figure 6: Aggregate Summary section

Answer

From Figure 6, you can see that there are 8,201 police officers in the dataset, their average salary is $121,009.165, and the standard deviation is $42,254.936.

Question 3: Determining the distribution of salaries of full-time employees

For this question I wanted to know by job title, what the salary distribution is of full-time employees.

Step 3.1: Filtering out part-time employeesFor this step, I again used the FILTER panel on the Salaries table to filter the Status column, which contains the shift classifications, by values that are equal to FT (full-time).
Figure 7: Filtering on the full-time value
Step 3.2: Grouping employees on their Job TitlesSince I wanted to see the distribution based on the average salary for each job title, I ran another Group By operation. I applied the avg function on the TotalPay column to find the average pay for a particular job title.
Figure 8: Using the Group By operation for finding the average pay
Step 3.3: Graphically modeling the distributionTo visually display the distribution, I ran a profile operation on the Ave_Pay column in the Avg_Pay_By_Job_Title table, which opened the Profile modal window.In the Profile modal window, from the Range bucketing list, I selected Fit all, which displayed the results as a bar chart, as shown in Figure 9.
Figure 9: Profile modal window Fit all result

Answer

When analyzing the results, I noticed that the distribution of pay by job title appears to be right-skewed (the mean is greater than the median) and that the majority of the employee wages are displayed further to the left in the bar chart. In the Quartile Summary section, you can see that the median income is $95,824.432 for full-time employees in the dataset.

Question 4: Is there any correlation between the base pay of an employee and their overtime pay

For this task I wanted to see if there is a correlation between base pay and overtime pay.In the Salaries table, I selected both the BasePay and OvertimePay columns, right-clicked on the columns, and then selected Correlation.
Figure 10: Running the Correlation

Answer

Figure 10: Running the Correlation operationAnswer:The result was a correlation coefficient (r) of .264, showing a weak, positive linear relationship between the base pay of an employee and their overtime pay. As shown in Figure 11.
Figure 11: Correlation and Quick Aggregates modal window

Conclusion

  • With Xcalar I was able to explore huge amounts of data interactively. This is due to Xcalar’s unique scalability, which can perform interactive analysis on very large datasets, up to a trillion rows with relational operations, including join and group by.
  • I was able to discover answers to my questions with very few operations in Xcalar Design, which is a web-based visual modeling tool that enables interactive and intuitive design of algorithms to derive meaningful insights from data.
This concludes my first Xcalar Blog entry, stay tuned for more complex examples to come.
For more detailed instructions on how to do the tasks in Xcalar Design that are described in this blog, see Xcalar Design Help documentation.
Dan Kapper