Jungle Scout case study: Kedro, Airflow, and MLFlow use on production code
Author: Eduardo Ohe, Principal Machine Learning Engineer, Jungle Scout
Special thanks to Lais Carvalho (developer advocate at QuantumBlack) for her collaboration in this article.
This case study describes how the data science team at Jungle Scout, the leading all-in-one platform for finding, launching, and selling products on Amazon, use Kedro to deploy our Machine Learning models into production.
About Jungle Scout and Kedro use
Jungle Scout helps sellers on Amazon start and scale their businesses. Currently, we employ more than 150 employees — 50 are engineers.
The data science team is composed of Kedro users. As the company expands, new hires are quickly onboarded with Kedro as well. Since our models are built upon data, we need a framework that enables collaboration and integrates modularity.
I’m Eduardo Ohe, Principal Machine Learning Engineer, and I worked with Gabe Shaughnessy, Principal Data Scientist, and the team at Kedro to explain how Kedro is integrated into Jungle Scout’s production workflow and how it has helped us write effective data pipelines that are production-ready from the get-go.
Our team found out about Kedro from a colleague Alex Handley, Platform Group Architect. Since then, Kedro has streamlined our workflow process, avoiding a lot of back and forth with debugging. It allowed our company to deliver more value to our customers quickly, which is the team’s #1 goal. Kedro has been tremendously valuable for us as a team and as a company.
“Before Kedro, we had many notebooks in different versions in different files and directories. Everything was scattered,” says Shaughnessy. “Implementing Kedro into our workflow has made life a lot simpler and cleaner in terms of pipeline organization. In addition to organization, being able to drill down into specific nodes to understand where a particular problem lives is the main gain. Now we don’t have to run and compare notebooks individually.”
At Jungle Scout, we train new Sales Estimator models for each Amazon marketplace (USA-based) frequently. To ensure quality, we perform manual checks on each model before they are released to our customers. The entire process, from training to review to model promotion, takes roughly two hours. The plan was to support nine other marketplaces, which would mean 18 hours of revisions for every release added to our workflow.
We decided to revamp our training, review, and promotion pipeline by modernizing our infrastructure.
- Model training: We restructured our models with Kedro, which allowed us to build data pipelines that were testable and could run in parallel.
- Scheduling: We used Airflow to help schedule, parallelize, and monitor model training jobs.
- Model Review: We built a custom model review and promotion app based on Dash. This allowed us to quickly review several models across multiple marketplaces with confidence.
We now can train, review, and promote models for all marketplaces in about one hour per release. We are 18 times faster than our old process while supporting more marketplaces!
Our workflow with Kedro
Here’s an example of our most common Machine Learning model, the Sales Estimates (SE), using Kedro:
To build our model, we combine several data points and product categories. The model aims to predict how many sales a particular product will accrue. This helps our users opt for products with low competition, hence greater chances to sell successfully on Amazon.
Once submitted, each Pull Request (PR) made to the Sales Estimates Kedro Project in a Github repository is checked using CircleCI to run the Kedro tests. Once all tests have passed, a new Docker image is pushed to the Elastic Container Registry (AWS ECR).
The deployment happens inside an Elastic Container Service (AWS ECS) in an Airflow environment composed of the webserver and the scheduler components. To divide the tasks between workers, we use Celery, a distributed task queue, which allows the scalability of these workers. These workers also communicate with the storage layers in our “data sources container”. This container is composed by Amazon’s Elastic File System (AWS EFS), which will share the assets required for those workers and pull data from AWS Redshift to produce the models.
This ensures every time the Airflow Docker operator runs, the image installed at AWS ECR is checked. If there is a new image or a new version of the image in ECR, that image is pulled and the entire Kedro pipeline is run inside the Docker container.
The image that we pull from the GitHub repository will be the pipeline that will build our model. For the exemplified project, for example, our custom-built Scikit-Learn plane model will run inside the Airflow workers, fetch all the data that is necessary for training from the data sources, and output a model that will predict the sales based on the product’s features. This model will be serialized and stored in a RDS PostgreSQL database and the deployment will be administered using a Python Flask API in an AWS Elastic Beanstalk (AWS EB) environment.
For tracking and experimentation, we decided to pair Kedro and MLFlow. This allowed us to import MLFlow, track all the parameters, and log the artifacts. In a nutshell, we have a server that integrates MLFlow with the environment. The metadata persistence happens with RDS Postgres and the artifacts persistence is performed in AWS S3, which makes it possible to use the MLFlow user interface as well.
We decided to use Postgres to store our models because our Sales Estimates models are trained frequently. In case we need to query the sales prediction for a particular date, e.g. the first week of the month, it is easier to filter the model closer to that particular date.
The benefits of Kedro
One of the main advantages of using Kedro is applying the concepts of software and data engineering in the data science world in a very smooth way. Now, it is possible to see everything working together: data engineering plus the data science portion in the same repository. We can observe exactly what the data sources are and look into the catalogs. It is really helpful to manage all the projects and easier to wrap end-to-end solutions using Kedro. Our next steps are to continue applying more MLOps best practices to achieve continuous near real-time/real-time model training, deployment and monitoring.
Shaughnessy thinks the main benefits of Kedro are better workflow collaboration and more efficient hiring practices.
“As we onboard new hires, they’re onboarded onto Kedro as well. It’s been faster than we were expecting for our hires to learn Kedro and add value with the tool.”
Adapting the use of Jupyter
We still use Jupyter Notebooks for Exploratory Data Analysis (EDA) frequently, but just to hash out ideas. For anything beyond that, we strictly use Kedro pipelines. For data visualization, it is easier to use Jupyter — there is still a lot of value in the tool. But in terms of training and more complex tasks, Kedro has the upper hand.
In the process of learning how to use Kedro, the team also has become a fan of the Youtube channel DataEngineerOne. Here Tam Nguyen, a McKinsey data engineer and solutions architect, explores Kedro’s powerful features with experimentation.