Optimize Utility Maintenance Prediction with Intel® AI Analytics Toolkit

Using the open source Predictive Asset Maintenance Reference Kit (built as a collaboration between Intel and Accenture*), this video demonstrates how to optimize the training cycles, prediction throughput, and accuracy of your machine learning workflow with the Intel® AI Analytics Toolkit and the Intel® oneAPI Data Analytics Library (oneDAL).

AI Reference Kit

GitHub*

 

 

Hello, my name is Kelli and today I am going to be showing you how to optimize your machine learning workflow using the Intel® AI Analytics Toolkit (AI Kit) powered by oneAPI.

The dataset in this tutorial was generated using the Predictive Asset Maintenance Reference Kit built by Intel and consists of 100,000 different utility poles with over 30 features on the overall health of the utility. Our target variable is a binary indicator, representing whether or not the utility pole requires maintenance. You can get the code for this open source reference kit and find out more about it by clicking on this link, which will take you to the GitHub* repo.

The main libraries that we'll be working with in this guide are the Intel® Distribution of Modin*, Intel® Extension for Scikit-learn*, XGBoost, and Daal4py, all of which can be downloaded as part of the AI Analytics Toolkit or as stand-alone libraries.

The underlying hardware that we will be using is a 3rd generation Intel® Xeon® Platinum 8375C processor, which is an Ice Lake CPU, on an AWS* M6i.4xlarge instance.

To get started, we’ll be importing our libraries and dataset, and we’ll be using the Intel Distribution of Modin to process and explore the data. Modin is a distributed DataFrame library designed to scale your pandas workflow with the size of your dataset and supports datasets ranging from 1 MB to 1 TB+. With pandas, only one core is used at a time. However, with Modin, all of the available cores are used, which allows you to work with very large datasets at much faster speeds. There are a few different engines you can use with Modin and in this demo, we’ll be using Dask. To utilize Modin with the Dask engine, you can import it similar to pandas by importing modin.pandas as pd and then initialize the Dask execution environment in the engine call statement that will be used for distributed computing, shown here in this code cell.

To begin exploring our data, I am first checking for any missing values in the columns as well as for any duplicate records, but as we can see the dataset is fairly clean with no missing or duplicate entries. I’m also checking the descriptive statistics of the data, which provide a numerical summary of the central tendency, dispersion, and shape of the data's distribution. And I’ve also added in the skewness and kurtosis into the function.

Now, let’s take a look at the distribution of our target variable, which is Asset Label. About 40% of poles in our data have been identified as needing maintenance, shown in the yellow portion of the chart. Given the slight imbalance in the distribution of our target variable, we may want to use stratified sampling during cross-validation.

For the rest of the EDA, I’ve separated them into sections based on the data types of the variables. These graphs show the individual kernel density plots of the numerical features, colored by the target label, so that we can see underlying distributions of each variable, and the next set of plots show the relationships between the numerical features and the target variable. In Age, we see that in poles that are above about 45 years and less than five years, the probability of needing maintenance tends to be higher, while in Elevation this is true for poles that are below about 1,000 feet. In addition, we see there are non-linear relationships between the target and the features, which suggests we may want to try a nonparametric model. This is further demonstrated in the scatterplot matrix below, where we see very weak linearity and low correlations between the numerical features in the data.

Now that we have explored our numerical variables, let's take a look at the distributions in the categorical features. The categorical features in the dataset have already been preprocessed using one-hot encoding, which qualitatively represents the presence or absence of the variable with a corresponding 1 or 0, respectively. These graphs show the frequency of the occurrence of each feature, colored by the target label.

Lastly, this correlation matrix shows the strength of the linear associations between each of the variables, which are fairly low overall. The highest correlation in the dataset is a 0.31 between our target variable and whether the pole was treated.

Now that we’ve done a little bit of data exploration, we’ll begin training different machine learning models to predict whether or not a pole will need maintenance. With binary classification tasks, there are many different models to choose from. Some of the most common models include Logistic Regression, Naïve Bayes, K-Nearest Neighbors, Support Vector Machines, and ensemble methods, like Random Forests and XGBoost. Since we saw there are some nonlinear relationships between the features and the target variable in the graphs above, we will compare two types of nonparametric models: Support Vector Machines and XGBoost.

The first model I will fit is a Support Vector Classifier (SVC). To reduce the algorithm run time, I will be using the accelerations provided with the Intel Extension for Scikit-learn. To use the scikit-learn extension is very easy to do without any changes to your code, all you need to do is call the patch_sklearn() function as shown in the code cell below and then import your scikit-learn libraries and the patch will replace supported stock scikit-learn algorithms with their optimized versions. To learn more about this extension, please visit our developer guide, which is linked here in the notebook.

Once the scikit-learn extension has been enabled, we will then import the scikit-learn libraries that we’ll need. Next, we’ll scale and split our data into training and test sets using the prepare_train_test function and initialize the support vector machine with a class weight of balanced and probabilities equal to true. Then, we will use stratified k-fold with three splits to tune the hyperparameters. Here, we are going to be grid searching through several different kernels and regularization parameters: C using the area under the curve to optimize the parameter selection, with the number of jobs set to -1 to parallelize the training. Then, we’ll store the best model found and perform inference on the out of sample test set.

Here we see the best hyperparameters found were the Polynomial kernel with a regularization parameter of 3.16. And in the graphs below, I’ve plotted the ROC and precision-recall curves. On the test set, the tuned SVC model gave us an area under the curve of 0.91 and an average precision score (or AUPRC) of 0.887.

To see if we can further improve on the results of the SVC, we’ll now compare the performance of an XGBoost model. Since XGBoost version 0.81, Intel has introduced many optimizations to maximize training performance. There is no need to import any additional libraries, the optimizations have already been up-streamed into the package.

I’ll first be initializing the XGBoost model with a scale_pos_weight equal to the proportion of positive samples in the dataset and setting the tree method to hist, which is a faster histogram-optimized algorithm that is recommended to use for higher performance with large datasets.

Following the same steps above, we will be using stratified three-fold cross-validation to tune the hyperparameters. And we can see here below the best hyperparameters found were a maximum tree depth of five with a minimum of at least five samples in each node and a gamma of 0.5, which represents the cost complexity of adding an additional leaf to the tree and can help prevent overfitting in more shallow trees. The best AUC on the cross-validation set was about .938.

For further improved performance on prediction time, I’ll be converting the tuned XGBoost model into a daal4py model. Daal4py is the Python* API of Intel’s oneAPI Data Analytics Library (oneDAL) and utilizes the underlying AVX-512 hardware to maximize gradient boosting performance on Intel® Xeon® processors.

To convert a tuned XGBoost model to daal4py is very easy to do with one line of code using the get_gbt_model_from_xgboost function. Then, you can send the trained model with the input data into daal4py’s prediction function to calculate the probabilities on the test set. And here are the results of the performance of the XGBoost model. XGBoost was able to produce both a higher AUC and Average Precision score than the SVC.

That concludes this demo using the Predictive Asset Maintenance Reference Kit. Thank you for watching, and if you would like to download and use this reference kit, you can click on this link in the notebook and it will take you to the GitHub repo.