When I joined Paladin AI, we were in the process of a major overhaul of the machine learning models that underlie the majority of InstructIQ's analytics and smart review features. The models were in active development - not a day would pass that new data wouldn't come in, the data scientists actively look at flight sessions, and data cleaning, training or inference would be improved. However, getting these hot-off-the-press upgrades to all the training centers we were installed at so we could demo the better features and improved classifications still took months.
We used Databricks for all data science tasks, including improving the models, due to a superior usability for the data science team. While this was a lot more productive than writing individually deployed Python workflows for iterative development, I had a bit of a wake-up call when came the time to investigate how to serve our actual customers from it.
First, Databricks is installed into one AWS account, and expects to scale up all compute there (and that you will set up the accesses for those instances to any relevant data). As is a common pattern for basic access control, we have a different AWS account for development and production, and the data in the production account is further segregated to provide isolation between individual customers. The Databricks cost portal is also not spectacularly enlightening, so I wanted to use different Databricks accounts to at least break down cost reporting between the daily (and intensive) usage by the data science team and the actual operational expenses of running inference for client accounts.
The production workspace is a Databricks workspace that runs in our AWS production account. We do not use individual workspaces or notebooks there; all code and data is (selectively) imported from the dev workspace through the processes detailed below. Apps in the production account have a Databricks API token to start jobs in this workspace exclusively, meaning all customer workloads run there.
Databricks has a relatively new feature called Unity Catalog. Unity Catalog is essentially a datastore that can be shared between accounts without requiring extra infrastructure to host an external metastore on. Using Unity Catalog, our "gold" (cleaned) training data could be shared between Databricks workspaces so that models in the production workspace would be able to use the same data as during the original training. (It also makes it so data scientists cannot accidentally retrain a model that's used in production using less clean data or data of uncertain lineage).
We had three challenges with Unity Catalog:
To access Unity Catalog, you have to enable it in both workspaces (that's what we did) and then add the Unity Catalog metastore in the data tab.
Once that's done, Unity Catalog is accessible from SQL or pyspark (we use SQL to send data to Unity Catalog after ingestion, and pyspark in our regular training).
The Unity Catalog tables are named <metastore name>.<schema name>.table
. The metastore name will be set when you add it (defaults to main
); the
schema and table you have to create manually. There is no default
schema like you may find in the default metastore.
Obviously, we also want the model code and parameters to be thoroughly tested, then reviewed, then pushed to production, where they can't be changed in place.
To enforce that, we store all model code (for training and inference) in GitHub. A main
branch is used for the models being used by the development environments,
after they have been validated on the test datasets and reviewed by data scientists, once they are ready for integration with the app; a release
branch is used for the
models in use by customer workloads. Once code is pushed to release, all models in that repository are re-trained and re-registered in the Databricks production environment where they become
available for use.
This was a bit counter intuitive to data scientists because they didn't understand what to do when working on notebooks for code to be pushed to GitHub. It requires using the "Save" feature as well as making sure your Git repository is set up in Databricks. Since all data scientists work in Databricks, they were also afraid of occasionally overwriting each other's changes; the way we prevented that was having everyone set up their own copy of the repo (under Repos>user name) pointing to their own branch or feature branch. Pull requests to main can then be used to review each other's code. This also allows for test notebooks (scratch data cleaning or validation notebooks) to only exist on an individual data scientist's branch if they're only used during development, while still being searchable and archived.
Although the process is very automatable (through GitHub Actions or other source-dependent workflows + Databricks API) and works well to preserve the team's ability to make changes and track which version of models are being used in each environment, it does not do everything. For instance, we can't have canary deploys (have a percentage of production API calls sent to one model version) doing it. We don't have enough individual customer accounts that redirecting all of a customer's traffic to a different model version would be worth it, but if we wanted to do that we'd need some way to tag registered models and to use more complex scripts to call the models (to select a different model depending on the request). Although Databricks offers some nice features for the development stage, this is one place where it doesn't have great support for MLOps type workflows.
That said, being able to release models safely and especially, fast, is most important for a startup where we essentially just want fixes and features live as soon as possible. It also offers some very nice benefits: