Predictive Analytics with Machine Logs at a Leading Hardware Company
A leading computer hardware company has systems deployed around the world. To better understand how their products are being used, this company collects telemetry from most of these deployed systems. Data bundles arrive daily from several hundred thousand systems, resulting in petabytes of raw data to store, manage, and analyze. Xcalar partnered with their Data Analytics team to improve efficiency and completeness of the data discovery workflow. Xcalar has created business value in several areas:
Measuring the use and effectiveness of system performance and efficiency features, and recommending how to increase the value of those features to customers
Understanding customer system configuration choices, and how they can affect performance and reliability
Modeling trends and applying machine learning to predict customer business opportunities and risks
Improving the efficiency of developing and delivering new systems and software upgrades
Identifying systems which are at risk of failure or unavailability due to misconfigurations or environmental issues
The size and complexity of the incoming data bundles require a complex tool chain to process and analyze. A typical bundle may consist of more than 100 sections, each represented by a single file. These files come in a wide variety of formats, including XML, Excel, binary, logs, text-based tables, and free-form text, some using mixed character sets. File sizes range from a few kilobytes to over one gigabyte. Field counts per file type vary from a handful to over 500. Formats vary widely across different system types and different system software versions and models of the products sold. It has been a challenge to fully leverage this data to gain insights, determine root-causes of system problems, and predict future problems, due to the following reasons:
Deriving insights requires multiple steps with multiple tools, often involving multiple people in different groups and geographies.
Raw data bundles are pre-processed to intermediate on-disk tables using a complex set of parsers thousands of lines long. Few people understand this transformation or how to maintain it. Changes and enhancements to the raw data file formats occur frequently. It is difficult for the parser maintainers to keep up with changes and verify that their parser updates are correct.
There is no mechanism to ensure that the processing of raw data files into intermediate persistent tables is accurate, complete, and contains no duplicates. There is no visible lineage back to the original data.
In order to gain insights from these complex source files, users must combine information from multiple sections of the data bundles and analyze it relationally. Traditional analysis tools require programming to discover correlations, analyze trends, or predict problems, failures, or outages. Without a way to perform these actions within a visual framework, the iteration and debug cycles become very lengthy and inefficient.
Xcalar Data Platform provides an end-to-end analysis capability. It orchestrates the analysis from raw data file import to result tables, which can be exported to visualization and reporting tools at the customer site. Users explore their data and develop complex algorithms visually, using a spreadsheet-style interface to access petabytes of data.
The solution leverages many of the features provided by the Xcalar Design visual interface, including shared datasets and UDFs (User-Defined Functions), custom Python parsers, profiling and statistical analysis, windowing, data lineage, and auditability. Xcalar Compute Engine provides the scale and performance to model dataflows on multiple datasets with hundreds of millions of rows. In addition, Xcalar Compute Engine can operationalize these dataflows on larger datasets, or for specific date stamps on the source data. The combination of these capabilities results in dramatic productivity improvements and insights at a level that was previously not possible with the company's existing analytics toolset.
Data Bundle Section Datasets
Users create datasets by importing a set of files on an NFS server that represent a specific section in each of the data bundles. Users start with one dataset, then create additional datasets containing the information they find they need during modeling, using a unique data bundle identifier as the join key.
Datasets are shared among all users in Xcalar Design. Once one user has created a dataset for a specific section and date range, others can create a table from it and use it in their dataflows as well. The original data files remain unchanged throughout; they are simply referenced as needed.
The data format of bundle sections differ widely. Xcalar directly imports those that are in an open format such as XML, Excel, CSV, JSON, Parquet, or logs. Users develop custom import UDFs to parse custom file formats, such as tables-as-text or key-value pairs interspersed with unrelated text. Typical python import UDFs are only 40-50 lines long. They can be modified or updated directly in Xcalar Design, as bundle section formats evolve over time. As new sections are added, new import UDFs are developed, often heavily leveraging an existing import UDF for a different section.
A common approach to writing bundle section parsers is to start with something simple, and then use Xcalar Design to better understand the structure of the data visually. For example, a very simple UDF, or even the built-in text format parser, can parse a log file into one record per log entry. Further parsing using split and other map operators can be done visually, verifying successful parsing as you go. Alternatively, users with coding experience might prefer to take insights gained from this initial exploration and integrate them into the parsers.
Profiling, Exploration, and Validation
The first step in modeling is to create a table from one of the section datasets, and perform initial profiling to understand and validate the data. For example, a user might use profiling and map operations on various columns in a table containing basic system information to visually confirm that all records contain valid software version numbers, or that the version matches what is expected for each model type. Because Xcalar Design uses a powerful scale-out compute engine for modeling, users can play with tables of hundreds of millions of rows, and quickly validate, explore, and transform weeks’ or months’ worth of bundle data interactively. Some common data errors that users have discovered include:
- duplicate, incomplete, or missing records
- misspelled or inconsistent labels
- inconsistencies between fields within certain records
- systems with non-unique serial numbers
- systems invalid for analysis because they are used for internal testing
Because data lineage to the original raw files is preserved, it is easy to root-cause the source of such errors and resolve them.
Figure 1 shows an example Profile graph. The graph shows the distribution of the number of bytes read from a large number of disks, using a log scale.
Many analyses of telemetry and log data involve examining how configurations, performance, and usage change over time. One example involves analyzing software version upgrades and downgrades. Simple aggregations of point-in-time data can show the distribution of software versions running on all systems. Aggregations done at multiple points in time can provide overall adoption rates for new software versions. But to understand how frequently and at what times specific systems were upgraded or downgraded, we must look at all the data over a longer time period. We apply windowing on this data to find all version transition points. Then, we join that information with other system characteristics to understand why administrators upgrade or downgrade certain systems at certain times. Furthermore, we can feed these transition points into an additional analysis that relates these events to other changes occurring on the system. These events might include performance and efficiency metrics or usage patterns immediately before and after the version transition. We can then train a machine learning model to predict what might happen on other systems when they go through that same software version transition.
Being able to predict system events or customer behavior has high business value. Two key examples in this case study are predicting system failures and predicting new business opportunities or risks.
One source of failure or unavailability is system misconfiguration. We can build a predictive model based on the combination of on-site configuration, system type, and applications for systems that have failed. This model can predict which systems in the field are most likely to fail. Customers or support personnel can then take action to prevent those failures.
A customer running heavily utilized systems may be ready to purchase additional equipment, particularly if the utilization trend is increasing. In contrast, a customer running systems that are increasingly idle may represent a business risk – perhaps they are migrating to a competitor's system. A predictive model of these trends across all systems provides a list of customers who should be engaged by their sales team.
A Machine Learning expert at this company uses Jupyter Notebook to develop and train algorithms using a representative sample from the data. Once the model is ready, Xcalar parallelizes the execution of the model on production data. The user pastes the TensorFlow python code into a Xcalar UDF, makes the trained model file available, and Xcalar performs classification and scoring in parallel across all nodes and cores in the cluster.
Users create a batch dataflow from a model in Xcalar Design at any time. The batch dataflow is parameterized to point to different data source files, apply different filters, and output the result to a specific place.
User can schedule batch dataflows. For example, a batch dataflow could be created to find all systems that have had their software version downgraded within the past month. The company can contact the customers who have done downgrades to understand why they downgraded. This leads to improvements in the product as well as better customer relationships.
Key Benefits of Xcalar Technology
Data Exploration and Visualization: The visual analysis capability provided by Xcalar Design means that query language expertise is not required to develop complex dataflows.
Data Lineage: Dataflows show lineage back to original raw data bundles. This makes it easier to validate the correctness of transformation steps taken during analysis or directly detect problems in the raw data.
Simplicity: Users can perform all steps, from source data file import, data preparation, cleansing, and analytics, within a single visual tool.
Data Access: Xcalar can parse custom data source formats using small import UDFs. Users can modify, enhance, and tune import UDFs directly from Xcalar Design, then share them with other users. Import UDFs run in parallel across cores and nodes to saturate available NFS I/O bandwidth to the raw data.
Collaboration: Users can share datasets from within Xcalar Design, rather than having to re-derive tables using query language or other tools.
Extensibility: The windowing extension in Xcalar Design is a powerful time-series analysis tool for analyzing performance and system utilization over time, and identifying transition points such as software upgrades and downgrades. Further extensions to Xcalar Design will provide push-button capability for common operations.
Once the model is developed, a batch dataflow can be run against data from various date ranges. For example, a user can build a dataflow model on one month's worth of data, then run a batch dataflow to process the data for the entire year.
Users can bring in new datasets at any time, and blend them with the dataflows already being modeled.
Data Exploration and Visualization: Users can explore the bundle data visually. They can look for relationships, correlations, and errors in the data, and correct them quickly.
Performance and Scalability: Real-time visual analysis on hundreds of millions of records in real time is enabled by a scale-out architecture and minimal network data transfers. Import UDFs run in parallel across all cores and nodes.
Machine Learning: Tools such as Jupyter Notebook are best suited for initial training of data and generating models. Once these models are developed, they can be applied using Xcalar on large datasets iteratively. Models can be deployed across all cores and nodes, including confidence scoring and retraining the models as a continuous cycle of ML algorithmic development.