Vertex AI is a powerful addition to the GCP suite of tools. It not only unifies a lot of its amazing features together, but it also offers a dashboard to step and pass data through each of them. The design is clever, if you think about the steps you normally take to build a model from start to finish, the dashboard will walk you through those same steps using buttons and drop-down controls. No more coding, no more deep data science knowledge, just buttons! And as with anything Google, you can keep it simple or make it as complex as your heart desires.
Select your dataset, select your features, choose a model (or let it find the best one with AutoML), train, evaluate, and so on. Though this may seem benign on the surface and comparable to some other platforms out there, it quickly differentiates itself when you think of scale and deployment.
Vertex AI can ingest, and train models using practically unlimited data from sources such as BigQuery and deploy those trained models in the same GCP serverless environment we’ve relied on over the years by doling out predictions via REST APIs. And all of this using only button clicks.
It also comes with cutting-edge tools such as model monitoring and management (ML Ops), model transparency, and notebook integration.
Let’s briefly look at some of them.
XAI – Vertex Explainable AI
Explainable AI is a set of tools and frameworks to help you understand and interpret predictions made by your machine learning models, and it is all integrated into Vertex AI. Below are images from the Google Cloud documentation showing which pixels in the image and features in the data contributed most strongly to the resulting predictions.
Hyperparameter Tuning
This feature works by running multiple trials of your training application with values for your chosen hyperparameters set within the limits you specify. Vertex AI keeps track of each run and offers up a summary of each trial along with the most effective configuration. Vertex AI’s hyperparameter tuning uses the following popular and commonly used algorithms to search for the optimum set of hyperparameters: 1) Bayesian Optimization (default one), 2) Grid Search, and 3) Random Search.
Endpoint Management
In typical cases, one might want to deploy their models with various resources for different environments such as testing and production. Perhaps one might even want to use their model within different applications whose performance requirements are quite different from each other. In such cases, one could deploy their model to a high-performance or a low-performance endpoint on Vertex AI. Apparently, one model could be deployed to more than one endpoint too to suit the performance requirements. Apart from that, Vertex AI’s endpoints also offer App-engine like services where the traffic between multiple versions or deployments of an endpoint could be easily managed.
Integrated Kubeflow Pipelines
MLOps is the practice of applying DevOps concepts to Machine Learning systems. Ever since the recognition of the importance of MLOps in practical machine learning problems, there has been an increased demand for platforms and frameworks that make MLOps possible. Kubeflow is one such framework where the orchestration of machine learning models can happen effectively. Vertex AI’s predecessor AI Platform offered Kubeflow services as pipelines where the Kubeflow is installed on a Kubernetes cluster and can be accessed through AI Platform pipelines. In Vertex AI, this process is simplified and upgraded where the Kubeflow features like experiments, the pipeline runs, etc. can be found in Vertex AI itself. Also, the upgraded version Kubeflow SDK (kfp>=1.6) supports Vertex AI pipelines and some new features like google-cloud-components. In short, Vertex AI simplifies the implementation of MLOps through its upgraded services.
Advantages of services offered by Vertex AI
The following table highlights some of the services offered by Vertex AI along with the problem they try to solve and how they make it easy for the end-users.
ML Services | Before Vertex AI | With Vertex AI |
---|---|---|
Datasets | Data used to be loaded in chunks or as an entire batch from the sources. | Managed datasets service handles the datasets more smoothly including chunking them for training applications (including both AutoML and custom training). Can check the statistics of the dataset at any time. |
Features | Feature-engineering processes can be hard to share and re-use across applications or projects. Can involve duplicate efforts and introduce a training/serving skew. | Vertex Feature Store enables storing, sharing, and serving machine learning features at scale. The feature values can be fetched for training, as well as served with low latency for online prediction. |
Labeling | Manually labeling the image/video/text content in the datasets can be hard to manage and time-consuming. | Labeling tasks use human labelers to annotate your data items (image/video/text) at scale. Well-labeled content results in better training data, which leads to more accurate model predictions. |
Notebooks | Running notebooks on local machines can make it hard to manage the environments across different applications. Also, the compute resources might not be sufficient in some cases. | Vertex AI’s Notebooks have JupyterLab pre-installed and configured with GPU-enabled machine learning frameworks. It is also easy to access other GCP services through notebooks. |
Pipelines | Orchestrating ML workflows typically involves configuring clusters/machine resources, writing DAGs/pipelines, and large applications that manage the training, testing, and deployment of the ML models. | Vertex AI Pipelines help you automate, monitor, and govern your machine learning systems by orchestrating your workflow in a serverless manner. |
Training | Training the ML model manually can consume a lot of time and compute resources. | Training pipelines are the primary model training workflow in Vertex AI. You can use training pipelines to create an AutoML-trained model or a custom-trained model. For custom-trained models, training pipelines orchestrate custom training jobs and hyperparameter tuning with additional steps like adding a dataset or uploading the model to Vertex AI for prediction serving. |
Models | Managing and deploying models manually can involve writing an application or framework to load the model and serve the inferences. The application might also need to handle pre/post-processing steps and the incoming traffic. | Vertex AI’s model resources help to manage a model on the GCP including deploying, generating predictions, hyperparameter tuning, etc. Vertex AI Models can handle both AutoML models and custom-trained models. |
Endpoints | Manually deploying models and generating inferences from them can involve dealing with web applications or frameworks to serve the model for end-users. This process can be tedious especially while managing the traffic, updating the model versions, and monitoring the traffic. | Endpoints are machine learning models made available for online prediction requests. Endpoints are useful for timely predictions from many users. You can also request batch predictions if you don’t need immediate results. Endpoints support versioning and traffic-splitting that will enable smooth transitions between versions and A/B-testing like situations. |
Vertex AI’s services are not just limited to the above-listed items, but there are many other features like Vizier, Metadata, Tensorboard instances, managed instances. Though some of them are available as preview services, they further demonstrate how well Vertex AI is packed up with solutions for smooth handling of datasets and training, testing, deploying, and operationalizing machine learning models in a cloud environment.
Vertex AI is a great step forward by Google Cloud for novice and expert data scientists alike. The platform does an outstanding job of offloading all the heavy lifting of ML lifecycle activities off the users’ plate, allowing them to focus on the business problem at hand. The platform is unified in every sense of the word, and one can expect the platform to only evolve from here with more productivity-boosting features to come.