Advice for Deploying Your First Machine Learning Stack

Published in

Thanx

12 min readSep 4, 2019

Machine learning models offer the promise of encoding complex logic into black-box systems, replacing legacy solutions that demand higher development costs and deliver weaker predictive capabilities. Though the prediction logic may be obfuscated by the model, the research, analysis, and infrastructure investments can easily outweigh the time spent building convoluted, nested if-then statements. This post offers some considerations for building a predictive pipeline as a standalone stack.

When architecting our ML pipeline our team encountered many issues we had never faced before. While companies like Airbnb [1], Uber [2] and Spotify [3], among others, offered excellent insights into best practices, those architectures required cumbersome infrastructure maintenance or offered solutions more apt for very large data throughputs. We needed a solution that the team could validate, required minimal infrastructure maintenance, and could scale to support many online and offline models. Faced with a dearth of resources that fit our needs, we ultimately decided to forge our own solutions to these problems. The considerations below offer guidance for establishing a data stack, evaluating cloud offerings, and determining which resources to invest in and when.

Establish Infrastructure Priorities

As with any product, building an MVP data pipeline can help teams avoid assuming unnecessary risk and set more transparent roadmaps. Identifying the key functional components of the system, such as the primary customer benefits and overall system impact, can help narrow the scope of the first milestone.

Before starting, consider whether your production database can support the load from a feature generation pipeline. Product-focused infrastructure stacks are not well equipped to manage the scale of computation required by ML systems. A model could require up to thousands of features per prediction as frequently as every minute. Querying a production database or a read replica at that cadence can impact critical user workflows. Additionally, isolating requests to a separate database also services the goal of establishing a data lake. Such data stores can hold a much larger scale of data and support faster aggregate query operations than traditional databases. Moving ML related requests to a separate data store can also increase developer productivity since teams can more quickly experiment with the schemas (or lack thereof) that fit best.

Creating a new data repository opens up many doors for possible stacks, all at different costs. Many suggest columnar databases that support distributed computation. Unfortunately, some of those systems either require dedicated infrastructure teams to maintain. For teams entering the space, designing a system that can scale in isolation from other product infrastructure and be validated quickly at a low cost can help get the first model out the door faster.

Thanx’s primary concerns while developing our ML pipeline were mitigating negative impacts on our production systems, minimizing unnecessary costs, and moving quickly. These principles helped inform decisions when choosing and implementing the infrastructure and served as good evaluation metrics for the many solutions at each layer of the stack. Below is an image of our early architecture:

Identify MVP Functional Requirements

Very little of the work required in building production machine learning systems involves actual data science. This popular diagram captures it well:

Source: Hidden Technical Debt in Machine Learning Systems [4]

Tackling all these pieces can seem daunting, so identifying the key features of an MVP system can help guide decisions on where to invest time and resources. We identified the following basic functional requirements:

Generate features atomically
Write predictions offline
Complete predictions within SLA

Machine learning models rely on training data, and as such are “change anything, change everything”. This nature necessitates that any feature used in training must exactly match the feature used at validation, testing, and prediction time. One viable option would be the creation of an isolated stack solely for generating features (feature store) with a proprietary database. The idea originated with Uber’s initial publication of Michelangelo [2], an end-to-end deep learning system. A Hopworks blog post [5] does a great job describing the benefits of that approach. The article highlights how models often make use of the same features, practitioners should not have to redo work/incur technical debt, and a central features store can reduce the complexity of the entire system. Isolating feature calculation from model development is a best practice, but approaching it with its own stack for an MVP may be out of scope. Instead, investing in a framework that can logically separate feature generation and model serving may suffice. However, it’s worth acknowledging that this is a monolithic approach.

Offline models offer a good first approach since they pass on exposing a prediction API. Any system should ultimately support online models and those APIs, but as a first step to proving the architecture it may not be absolutely necessary. There are also plenty of great applications for offline models that can be critical to a business.

Although offline models may have more relaxed prediction SLAs, they still abide by some. Defining these early on can help inform what/how many features are used for the model and what type of hardware to invest in. Further, if this is tested early on, it will help avoid migrating databases later. For example, considering the speed at which features need to be queried could either lead to setting up a distributed cluster at a higher time investment or simply storing them in a hosted NoSQL offering.

Given the first two product requirements, we thought the most flexible approach would be to use an ETL pipeline framework. We investigated Luigi [6], Feast [7], and Airflow [8], among others. Airflow proved to have the best community support and ease of setting up. Crucially, it offered idempotent, concurrent task management out of the box with an easy to use UI. Writing test coverage for jobs also seemed easy. Authoring those tests initially slowed down the development time, but establishing the patterns early on and writing tests that covered task inclusion and execution helped move faster with deploying new features. Maintaining standard development testing practices for the data infrastructure stack helped cement our confidence in the system since many workflows interacted with multiple system components.

We decided to skip on online validation and retraining mechanisms for the first iteration of our pipeline. These components are crucial to ensuring that the model performs well in production, but did not seem critical for launch. However, Airflow is well-equipped to service both of these goals and since we have implemented both.

Invest in Scalable Infrastructure

The functional requirements will help inform different solutions to invest in at each layer, but (besides the obvious system components), the following components should be considered:

Distributed worker management system
Data lake

Distributed workers can scale feature generation pipelines but can come at a significant investment costs. Identifying a balance between speed and resources may help teams decide on the right infrastructure. Data lakes help teams scale by offering large repositories that support aggregate queries, can store many features, and are primarily differentiated by cost and legacy cloud providers.

Distributed Worker Management System

Generating data sets or features for prediction can require massive, slow computation. These queries can draw out data analysis, cause delays at prediction time, and instigate a slow feedback loop between model development and analysis. Unfortunately, teams may find managing this piece more cumbersome than any other component of the system. As a very simple solution, Celery with some message broker (RabbitMQ, Redis) can distribute tasks across workers. Alternatively, event-triggered lambda functions could effectively complete the same asynchronous processing. However, for datasets in the terabyte range, Hadoop may offer a better solution. This post [9] does a great job outlining the various ETL stacks for companies with varying sizes of data. In short, focus on a SQL friendly data store initially and move to large-scale distributed task management systems when the size of the data necessitates it.

If a solution like Hadoop requires an excessive time investment, choosing a data lake that supports aggregate calculations can help speed up the system. In particular, solutions like Elasticsearch can offer a more performant solution than MongoDB.

In addition to supporting a system for distributed feature calculation, building some asynchronous mechanism for transferring data between a production database and a data lake can also help the system scale further down the road. However, this does not necessarily have to be the same system used for feature calculation since it will operate at a much smaller scale. At Thanx, we use a pub/sub system triggered by object lifecycle events to transmit data between our production database and data lake.

Data Lake

Data lakes serve as central repositories for all the data required for and created by ML systems. The most important metric for assessing the data solution is whether or not it enables data analysis. If the existing system already facilitates data exploration, then this is moot. However, investing in a data lake that supports a friendly query syntax greatly benefits both product and engineering teams.

Besides supporting easy search, a data lake strategy should also consider how data for processing and the final features are stored. Most feature generation tasks require extracting data, cleaning it, and performing some calculations. Ensuring that data gets passed through each part of the system correctly can be challenging at scale. Designing around an intermediary data store for processing will help build tasks in isolation from each other and generally reduce the complexity of the system.

While intermediate data can be removed when the tasks have completed, features should be persisted. Unfortunately, the size of this data grows quickly. Below are quick summaries of solutions worth considering and how to evaluate them.

AWS Redshift: Good for large data and data analysis, SQL friendly, but can become expensive.

Amazon Redshift offers a great solution for teams looking to support lots of data that can be easily queried, but can be pricey. According to the AWS calculator [10], at the time of writing it costs a minimum of $183 per month. However, as it is within the AWS ecosystem, it can offer a higher level of security if configured behind a VPC. Many practitioners with whom I’ve spoken with recommend it anyway, particularly for its ease of data analysis.

AWS Elasticsearch: Scales well, difficult for data analysis, great aggregate query support, can become expensive.

AWS also offers a hosted Elasticsearch service. These prices can also grow quickly, but the clusters scale very well horizontally and support a large set of aggregate computation functionality. Elasticsearch documents also use a configurable mapping rather than a schema, providing an extra layer of flexibility.

AWS DocumentDB: MongoDB but more expensive than MongoDB Atlas.

AWS DocumentDB is a more expensive competitor to a hosted MongoDB service, as discussed in this post [11]. Again, this could be a worthwhile investment over MongoDB Atlas if the rest of the system is in AWS and security is a priority.

AWS DynamoDB: Schemaless but can become very expensive.

AWS DynamoDB also supports schemaless documents, but charges by read/write throughput and, as a result, can grow in cost very quickly. The Sendgrid engineering team wrote a very thorough analysis of why costs skyrocket [12].

AWS S3: Cheap and fast, good for one-off data cleaning tasks.

AWS S3 also offers a good solution for processing large data since it offers very cheap storage. In particular, if working with an existing backup or some saved repository, lambda functions in conjunction with S3 buckets can serve as the basis for a serverless ETL stack. AWS published a guide here [13].

GCP Bigtable: Good for new pub/sub stack and cross-team data analysis and visualization.

GCP Bigtable offers a great solution for teams looking for easy business intelligence alongside their data store. The Google cloud infrastructure UI is also much easier to navigate than AWS and may offer lower startup costs. Finally, GCP has great support for event streams; this makes it a compelling solution for setting up pub/sub to push data from a production database to an intermediary data store, run batch jobs, and finally store it for analysis.

MongoDB Atlas: Very easy to setup hosted MongoDB instances

MongoDB Atlas offers a hosted version of MongoDB that runs on AWS, GCP, or Azure. It also has a free tier that doesn’t require any payment information. There are many other great solutions out there, but in reality the existing cloud provider usually predetermines the rest of these components.

There are different strata of resources that can be used for each layer of the stack based on team size, cost, and maintenance effort. At Thanx, we avoided setting up a Hadoop cluster because of the maintenance burden. We instead opted for Elasticsearch as a raw data store and MongoDB Atlas for feature storage. We chose the former due to its flexibility and since it had already been implemented in our production workflows. It also scales very well, does not require a schema, and can compute complex aggregate queries very quickly. Most importantly, it is fully managed by AWS. MongoDB Atlas required the least amount of configuration to setup and allowed our team to move quickly while iterating on other parts of the system.

Build a Trustworthy System

Slow and inaccurate models can cause more harm to users’ perception of an organization than good. Due to their complexity, feature pipelines need to be heavily tested and maintained with the same level of care as product features. Supporting tests and monitoring can help reduce the number of bugs released and leaning on asynchronous processing can make the system much faster.

Support Testing and Monitoring

As data moves through the system, validating that the correct transformations happen in the correct order can become difficult. Data science teams that bootstrap a first model may tend to pull data onto local machines, generate training features there, and either train the model on the same computer or in the cloud. Finally, once the model is trained, they may build an isolated system for generating features that feed into the model in production. This approach risks those features varying at training and prediction time, compromising the model’s integrity. As with any software systems, supporting isolated local, staging, and production environments helps facilitate testing and best engineering practices.

Besides supporting multiple environments, covering all computational tasks with tests helps establish idempotency of feature generation. Investing in these early on will dramatically reduce technical debt. Integration tests can especially help at the pipeline level to ensure that the correct tasks are executed in the correct order.

Although not covered here, monitoring the model in real time can also help avoid bugs related to incorrect predictions. At the least, negative predictions should be written to the data lake along with the features used to help debug the model. They can conveniently double as examples to retrain on as well. A more robust approach would include retraining the model at appropriate intervals to ensure that it encodes changes in the patterns of the underlying data. Again, ETL pipelines such as Airflow help achieve this.

Make it Fast

The ETL paradigm fits well with feature generation tasks since it encourages offline processing. Feature calculation can be the most time-consuming component of a prediction API, and preprocessing that data can help converge on acceptable request response times. Most users expect sub-second response times, so ensuring that results are quick and accurate is tantamount to a model’s usefulness. Identifying slow feature calculations early on and maximizing offline calculations can improve performance to an acceptable level. Dimensionality reduction can also go a long way as sometimes a few features may be the most powerful indicators.

Thanx’s Data Strategy

Applying software engineering best practices and building an MVP can help mitigate many issues that arise while building a machine learning stack. The simplest solution is often the best one, and in the context of building a prediction pipeline, this entails choosing the infrastructure that requires the least amount of effort and the lowest setup and maintenance costs while delivering the minimum functionality. For Thanx, this meant the following major stages:

(MVP) Deploying Airflow on Heroku with Docker, free-tier databases, and hosted MongoDB Atlas for feature store
Supporting concurrent task management
Deploying automated model retraining tasks
Migrating Heroku databases to AWS
Adopting Redshift as a data lake

Every one of these steps for us required a thoughtful and time-consuming approach. I hope that this article helps your team make better-informed decisions more quickly while still following best practices. Feel free to reach out to me (manav @ thanx.com) if you have more specific questions about the stack/offerings to invest in. Alternatively, if you are looking for a new job and are excited by our mission, we’re hiring (!!!) so feel free to reach out.

[1]: https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
[2]: https://eng.uber.com/michelangelo/
[3]: https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/
[4]: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
[5]: https://www.logicalclocks.com/feature-store/
[6]: https://github.com/spotify/luigi
[7]: https://github.com/gojek/feast
[8]: https://airflow.apache.org/
[9]: https://medium.com/@natekupp/getting-started-the-3-stages-of-data-infrastructure-556dac82e825
[10]: https://calculator.s3.amazonaws.com/index.html
[11]: https://medium.com/@michaelrbock/nosql-showdown-mongodb-atlas-vs-aws-documentdb-5dfb00317ca2
[12]: https://segment.com/blog/the-million-dollar-eng-problem/
[13]: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/amazon-s3-data-lake-storage-platform.html