Data scientists cannot possibly explore every possible modeling alternative relevant to their latest machine-learning project. That’s why we hear a growing drumbeat of demands for what’s now being referred to as “automated machine learning.” For example, it was an central theme of the keynote by Stanford’s Christopher Ré at Spark Summit this morning, as well as of Intel’s Michael Greene.
Essentially, this is an emerging practice in which data scientists use machine learning tools to accelerate the process of developing, evaluating, and refining machine learning models. Here is a quick list of the core machine-learning modeling tasks that are amenable to automation:
- Automated data visualization: This accelerates the front-end process of exploring data prior to modeling, such as by automatically plotting all variables against the target variable being predicted through machine learning, and also computing summary statistics.
- Automated data preprocessing: This accelerates the process of readying training data for modeling by automating how categorical variables are encoded, missing values are imputed, and so on.
- Automated feature engineering: Once the modeling process has begun, this accelerates the process of exploring alternative feature representations of the data, such as by testing which best fit the training data.
- Automated algorithm selection: Once the feature engineering has been performed, this accelerates the process of identify, from diverse neural-net architectural options, the algorithms best suited to the data and the learning challenge at hand.
- Automated hyperparameter tuning: Once the algorithm has been selected, this accelerates the process of identifying optimal model hyperparameters such as the number of hidden layers, the model’s learning rate (adjustments made to backpropagated weights at each iteration); its regularization (adjustments that help models avoid overfitting), and so on.
- Automated model benchmarking: Once the training data, feature representation, algorithm, and hyperparameters are set, this accelerates the process, prior to deployment, of generating, evaluating, and identifying trade-offs among alternate candidate models that conform with all of that.
- Automated model diagnostics: Once models have been deployed, this accelerates the process of generating the learning curves, partial dependence plots, and other metrics that show how rapidly, accurately, and efficiently they achieved their target outcomes, such as making predictions or classifications on live data.
Some tools are starting to emerge from labs and the open-source community to handle some of these automation tasks. Most notably, Google recently announced its own initiative, AutoML, which, the vendor claims, “will be able to design and optimize a machine learning pipeline faster than 99 percent of the humans out there.” Fundamentally, Google’s approach:
- Leverages several algorithmic approaches: AutoML relies on Bayesian, evolutionary, regression, meta-learning, transfer learning, combinatorial optimization, and reinforcement learning algorithms.
- Hubs modeling automation around a controller node: AutoML uses a “controller” neural net to propose an initial “child” neural-net architecture that it trains on a specific validation data set.
- Iteratively refines machine learning architectures: AutoML develops, trains, and refines multilayered machine-learning model architectures in repeated iterative rounds. It may take 100s or 1000s of iterated rounds for the controller to learn which aspects of the machine-learning architectural space achieve high vs. low accuracy with respect to the assigned learning task on the training set.
- Guides model refinement through iterative feedback loops: AutoML’s controller neural net acquires feedback from the performance of the child model, with respect to a particular learning task, for guidance in generating a new child neural net to test in the next round.
As this technology matures and gets commercialized, some fear that it may automate data scientists out of jobs. I think any such concerns are overblown. That’s because every one of the machine-learning pipeline automation scenarios these tools support requires a data scientist to set it up, monitor how it proceeds, and evaluate the results.
Manual quality assurance will always remain essential a core task for which human data scientists will be responsible, no matter how much their jobs get automated.