Automatic Machine Learning at Scale: H2O AutoML
For supervised machine learning tasks, AutoML can significantly speed up your time to deployment and tighten the feedback loop for feature experimentation. To make sure the tools perform reasonably well, we test H2O AutoML against a popular benchmark of commonly used machine learning algorithms using both a single virtual machine and a cluster of four such machines running Apache Spark on AWS ElasticMapReduce (EMR). Once you know how to use AutoML, you might want to think about where to leverage it so as to provide the most value. At the end of the post, we’ll give you some insight into how we think about these tools at IKASI. For now, here’s how to get up and running.
Automatic Machine Learning (AutoML) tools allow you to benefit from the predictive power of stacked model ensembles with just a table of data and the name of the column you want to predict. With H2O.ai Sparkling Water, these tools have reached the point where they can handle building one-off models on any amount of data. You can use AutoML to:
- Start with a good model as a baseline
- See how different kinds of models perform on your dataset
- Avoid common pitfalls of machine learning
how to: cluster
Create an EMR cluster using the AWS Management Console to make sure the default roles are created. Once that’s done you can terminate it. Add ssh access for your ip address to the inbound rules for the EMR master security group in AWS.
This script will create a cluster, or add steps to an existing cluster. Edit
run_test_emr.py before running it to set up your preferences and AWS account-specific information at the top of the file. It adds three steps for each modeling job, although the first step could be run as a bootstrap action instead once the modeling code is stable.
- Copy the modeling code (model_script.py) from S3 to the EMR cluster driver
- Use spark-submit to run the main step
- a. Use
spark.yarn.maxAppAttempts=1to disable retries
- b. Use ‘cluster’ deploy mode and ‘yarn’ master
- c. model_script.py is in python2
- a. Use
- Copy the model results from the cluster’s hadoop filesystem to S3
We trained H2O AutoML on a single virtual machine (r4.8xlarge) using ten million rows of data from the szilard benchmark. The data consists of eight input columns (six categorical, two numeric) and one output column (dep_delayed_15min). The categorical columns are of low cardinality, and there is no missing data. With a requested stopping time of six thousand seconds (one hour and forty minutes), H2O AutoML achieved an average Area Under the Curve (AUC) of 79.85 over three runs with its best-of-family ensemble. For comparison, on the benchmark the best linear model achieved an AUC of 71.1, the best Random Forest (RF), 77.8, and the best Gradient Boosting Machine (GBM), 78.7 in a similar amount of time.
On a cluster of four machines in Amazon Elastic MapReduce (EMR), the Sparkling Water version managed an AUC of 78.01, also averaged over three runs. When given four hours, Sparkling Water was able to achieve an AUC of 80.9 in a single run. The best result in the benchmark was an AUC of 81.2, achieved by training a GBM for more than nine hours. This is the power of stacked ensembles, models that combine the predictions of disparate machine learning models to do better than any one model by itself.
If your data has fewer than one hundred million rows, we’d recommend using single-machine H2O AutoML on a memory-optimized EC2 instance. Otherwise, or if you have existing Spark infrastructure, the Sparkling Water version still performs quite well, and there are other options available to those with more machine learning expertise. When using Spark, you may need to specify the data types manually. For the purposes of the benchmark, we had to convert string columns to enums using
Single Machine (r4.8xlarge) – 32 virtual CPUs, 244 GiB memory, $2.128 per hour
max_runtime_secs: 6000 seconds (1 hr 40 mins)
actual runtime: 6946.8 seconds, 6859.2 seconds, 6868.7 seconds
Leaderboard-1, Leaderboard-2, Leaderboard-3
Best of Family AUC: 0.800351, 0.796376, 0.798908
Best of Family AUC sample mean: 0.7985, sample variance: 4.049e-06
Cluster – four r4.8xlarge workers, one m4.4xlarge master
max_runtime_secs: 6000 (1 hour 40 mins)
actual runtime: 6362.7 seconds, 6316.5 seconds. 6439.1 seconds
step runtime: 1 hour 51 min, 1 hour 50 min, 1 hour 52 min
Leaderboard-1, Leaderboard-2, Leaderboard-3
Best of Family AUC: 0.77937, 0.780583, 0.780254
Best of Family AUC sample mean: 0.7801, sample variance: 3.935e-07
max_runtime_secs: 12000 (3 hour 20 mins)
actual runtime: 12646.8 seconds
step runtime: 3 hours 36 min
Best of Family AUC: 0.808793
Without access to the test set, we would have chosen
StackedEnsemble_AllModels, but sometimes they do slightly worse on the test set when compared to a particular GBM (Gradient Boosting Machine) or DRF (Distributed Random Forest). See this video by H2O for more on their implementations of GBMs and DRFs. The best-of-family ensemble model would include only the best models of each kind in the final ensemble. Extremely Random Trees (XRT), Neural Networks (DeepLearning), and Generalized Linear Models (GLM) are more examples of the kinds of model generated by H2O AutoML.
Until now, various implementations of the Random Forest algorithm have been the most user-friendly machine learning options. With the release of H2O AutoML, this is no longer the case. From an engineering standpoint, it is a huge advantage to be able to provide a training set, a test set (optional), and a column to predict, with no need to set any parameter other than how long you want the job to take. In addition, the output is considerably more helpful as a baseline than a linear model or any model alone. After feature engineering, machine learning is in practice about finding the best algorithm for the job. You need to test a panel of algorithms after each change to the input data, and H2O AutoML makes it possible at any scale.
At IKASI, we use AutoML as a sanity check, not a cure-all. It can’t find features for you, but it can tell you which ones are useful to your model. Don’t ignore the human context: define your objectives around your business, not just metrics like AUC. Finally, AutoML is a great way to avoid the pitfalls of Machine Learning. Data leakage for example, is commonly found when your model achieves perfect results on the test set. As usual, if it looks too good to be true, it probably is.
- How to provision a VM in AWS EC2
- emr-spec.json.template for EMR cluster creation
- automl_run_test_emr.py script for EMR cluster creation and step submission
- model.py script for Sparkling Water
Create a keypair (on a mac, using
ssh-keygen -t rsa), perhaps choose
~/.ssh/emr_test_rsa for your private key location, and
~/.ssh/emr_test_rsa.pub for your public key location. Then use
aws ec2 import-key-pair --key-name emr_test_rsa --public-key-material file://~/.ssh/emr_test_rsa.pub to import your key pair.
Create a launch template, replacing “subnet-1234a56b” with your own subnet:
Create an EC2 instance in AWS by choosing “Launch instance from template” from the console.
Choose the first version of the template, then scroll down and launch the instance. Once your instance launches, you can add your IP address to the security group Inbound Rules and use the public IP to ssh to your VM and install H2O. Select the instance in your console and you’ll find the public IP. Click the default security group to edit the inbound rules and add your IP address.
Next ssh to the VM using
ssh -i "~/.ssh/emr_test_rsa" ec2-user@<your-public-ip-address> and replace the ip address with the one for your VM.