Please read MLOps: Continuous delivery and automation pipelines in machine learning before beginning this tutorial.
To summarize the MLOps levels:
- MLOps level 0: Manual process
- MLOps level 1: ML pipeline automation
- MLOps level 2: CI/CD pipeline automation
The goal of MLOps level 2 is to achieve the same velocity and quality of the DevOps teams: new web applications and features are created, tested, deployed, and destroyed every day, if not multiple times a day, all with zero impact to a user. For example, Google constantly adds or updates products across its entire portfolio, which means there are hundreds to thousands of new deployments on any given day. Even if your company is not as big as Google, you and your AI/ML team can and should aspire to the same velocity as the application development teams. This means adopting the practices and principles of DevOps.
- Jupyter Notebooks
- I load the data in Jupyter Notebook
- I iterate on model
- I run training notebook to output model
- Non-Jupyter IDE: Code is written in .py files to accomodate containers, not .ipynb
- Docker Containers: Runs custom code repeatedly
- Pipelines
- I may iterate on training container
- Manually build Docker image and push to Artifact Repository
- I may modify Vertex Pipeline
- Manually recompile pipeline.yaml
- I may need new bucket
- Manually create a new bucket in Console
- Unit testing
- Production environments: Production environments should not be manually touched, only vetted code can deploy
- Terraform: Infrastructure as Code e.g. create new buckets using code instead of console
- Cloud Build: Build service e.g. docker build and push
- Git automation: When source code changes, changes are automatically tested and applied
- I iterate on training container locally
- I git push the changes to dev branch in repo of choice e.g. Gitlab, Github
- Cloud Build detects that change, then runs the steps in cloudbuild.yaml (*), which may include:
- Running unit tests
- Running Docker build and pushing to Artifact Registry
- Running
terraform apply - Running functional tests
- Once all code passes is dev, new changes may automatically pushed to Production depending on your DevOps
- The above checks and builds take place in Production and your new model is launched
(*) In this tutorial we will not connect Cloud Build to a repo, instead we will run Cloud Build manually to mimic the trigger would execute.
- Use .gitignore in same directory as cloudbuild yaml to ignore temp files e.g. terraform
- https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.13.1/api/v1/custom_job.html#v1.custom_job.create_custom_training_job_from_component
- https://cloud.google.com/docs/terraform/resource-management/managing-infrastructure-as-code
- OLDER: https://cloud.google.com/architecture/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build