Xcalar is perfect for running large scale data transformations and enrichment to seamlessly combine and analyze data. Loading data into the Xcalar platform opens the door to easily working with large datasets in order to efficiently garner knowledge and insights. In this exercise, we work with NYC car ticket records from a few years as we seek to understand which car brands produce the most ticket revenue for the city per borough. Here we go through all the necessary steps to achieve this.

Part 3: Visualization and Integration with Jupyter

3.1 While Xcalar has some visualization capabilities, many users may prefer to implement visualization in other BI tools, such as Tableau, Qlik, PowerBI, etc. We will cover these integrations in one of the coming blogs. Here I will describe how I used Jupyter Notebook to visualize data trends using Python visualization libraries. There are some steps that need to be taken in order to connect to our data and read it into our own python environment. In order to make the connection, we need an ODBC driver, which is supported by Xcalar. There are various options including Simba, Microsoft, Databricks, and others. We decided to download and install the Simba Spark ODBC Driver.

3.2 Afterwards the pyodbc module must be installed in the respective python environment. This module allows for simple accessing to ODBC databases. 

3.3 Before implementing the function for connecting to our data, you must ensure that the Simba driver is placed in the right file location based on the OS you are using. Here are the locations for the drivers for the 3 OS based on experience:

Windows: DRIVER=Simba Spark ODBC Driver

MacOS: DRIVER=/Library/simba/spark/lib/libsparkodbc_sbu.dylib

Linux (WSL2): DRIVER=/opt/simba/spark/lib/64/libsparkodbc_sb64.so

3.4 We are now ready to define our get_xcalar_connection function. This function will return a connection instance to your Xcalar account instance so that you can access the data there. 

import pyodbc
def get_xcalar_connection(uid, psw, url, port=10000):
    cstr = (
                "DRIVER=Simba Spark ODBC Driver;" #{Simba Spark ODBC Driver 64-bit}
                "AuthMech=3;"
                "ThriftTransport=1;"
                "SparkServerType=3;"
                "LogLevel=6;"
                "SSL=0;"
                "UID=%s;"
                "PWD=%s;"
                "HOST=%s;"
                "PORT=%d;"
            ) % (uid, psw, url, port)

    conn = pyodbc.connect(cstr, autocommit=True)
    # ODBC default is UTF-16, python3 is UTF-8
    print(cstr)
    conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
    conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
    conn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-8')
    conn.setencoding(encoding='utf-8')
    return conn

The code above has the DRIVER argument set to the default location as it is in Windows. Make sure to change that location based on your OS and location of the driver. 

3.5 Now, using this function we will initialize a connection object:

conn_qa = get_xcalar_connection(USERNAME,PASSWORD,'ca9u.xcalar.cloud')

Make sure to input the username and password you use for your Xcalar cloud login. Also, the third argument is the cluster url, which may be different for a given user - make sure to include the right cluster url for your Xcalar instance (for VPC it is predefined). For security purposes, it is of course a good practice not to directly input your password but rather read securely from elsewhere, especially if going to production. 

If all is well, running that line should return a text displaying the configuration of the connection object that you defined. Make sure that your Xcalar instance is running and is not timed out. 

3.6 With that we can now connect and obtain our data from the Xcalar platform. We can use the pandas read sql functionality to send SQL statements. 

import pandas as pd

import matplotlib.pyplot as plt

pd.read_sql_query("SHOW TABLES",conn_qa)

This should return a pandas dataframe of the tables that are active now in your Xcalar session. This confirms that everything is reading properly. 

3.7 Now we can easily read the MY_RESULT table into a pandas dataframe and then perform a simple visualization. Remember that our aim was to present the top ticket producing car brands (in dollars) for each NYC borough. So by filtering and grouping which was executed quite efficiently in Xcalar, as it applies these operations in parallel, we distilled the repository of all tickets from these several years in a rather quick execution time to reach a much smaller and refined data space to answer this question.  For simplicity sake, we can limit it to the top 10.

This code will give us the Pandas dataframe needed:

df = pd.read_sql_query("SELECT * FROM MY_RESULT",conn_qa)

df.columns = ["VEHICLE_MAKE","VIOLATION_COUNTY","NUM_TICKETS","VIOLATION_AMOUNT"]

df = df.sort_values('VIOLATION_AMOUNT',ascending=False).reset_index(drop=True)

Then we can graph:

df[df.VIOLATION_COUNTY.isin(['BX'])][:10].plot.bar(x='VEHICLE_MAKE',y='VIOLATION_AMOUNT',rot=45,title="Top 10 Ticket Producing Makes in Bronx")

df[df.VIOLATION_COUNTY.isin(['MN'])][:10].plot.bar(x='VEHICLE_MAKE',y='VIOLATION_AMOUNT',rot=45,title="Top 10 Ticket Producing Makes in Manhattan")

For length consideration, I plot only Manhattan and Bronx:

Toyota tops the ranking for both boroughs and in general a similar trend is seen, though with brands like Volvo and Lexus making the top 10 in Manhattan but not in the Bronx, which is an expected trend given. One interesting difference is the disparity between Toyota and the second place car brand (Chevrolet) in Manhattan, while such a large disparity isn’t seen in the Bronx.  

3.8 It’s important to relay how we have brought these two technologies together to enhance our engineering workflow, and in this coordination how they also differ. In bringing them together, we are utilizing a powerful compute engine in Xcalar  to perform complex transformation over large datasets where a single-threaded Jupyter would not scale with final and intricate visualization logic in our python notebook. So while in the ipython notebook, we may perform single-threaded operations affecting row by row, in Xcalar we have every core of the cluster running our operations in parallel - the difference in this scale and ability is important to emphasize. In Xcalar, we are able to apply a wide variety of business logic, as we have seen, which can be computed very efficiently taking advantage of parallelization. The tools such as matplotlib in our python notebook represent the “presentation stage” while the heart of our computation to bring our data to this final form is achieved by the Xcalar compute engine.