Case Study - Specialized ML and Data Science Workload
Peachjar is a medium-size educational technology company founded in 2011 that produces eLearning, communication, and other software solutions targeted at increasing parent and student engagement in the learning environment. Peachjar unites schools, parents and communities in a joint mission to elevate student outcomes. Its cloud-based communication platform improves and streamlines school-to-home and community-to-school-to-home communication by distributing critical information and resources directly to parents as interactive digital flyers.
Peachjar was searching for a solution to running specialized machine learning and data science workloads on an elastically scalable, cost-efficient platform. Their existing system of virtual machines required excessive maintenance and was both difficult to scale and difficult to replicate. The orchestration of their TensorFlow and Spark workloads was handled through a custom-built solution that failed to meet Peachjar’s needs and took capacity from their development teams.
Peachjar’s VM solution was manually deployed and extremely difficult to replicate across environments. This was detrimental to their testing efforts and increased the risk of delivering changes and new workloads to their production environment.
Peachjar’s existing solution was also cost inefficient. They required large and specialized instance sizes for their workloads, and these resources remained idle for long periods of time. They were often left on after jobs completed, and accumulated large cloud bills for resources that were severely underutilized.
Peachjar’s engineering and technology teams are comfortable with containerized environments and wanted to move their data science and machine learning workloads to a containerized solution. They did not, however, have the staffing in place to maintain a highly specialized Kubernetes cluster, and did not see value in maintaining the underlying infrastructure of the container orchestration platform.
AWS’ Elastic Kubernetes Service was a perfect fit for Peachjar’s needs. Other services such as CloudFormation, CodeCommit, CodeBuild, and CodePipeline also provided a highly duplicatable solution that allows the customer to reliably deliver the solution across environments easily and quickly.
Stratus10 was the perfect fit to deliver the containerized solution for Peachjar’s machine learning and data science workloads. Their expertise across AWS’ suite of services ensured that Peachjar’s solution was delivered using the best combination of tools available. Their experience in delivering complex containerized solutions across the AWS ecosystem resulted in a smooth delivery of the EKS solution that met all of the customers scaling, cost, and performance targets.
Stratus10 assisted Peachjar in migrating their machine learning and data science workloads from a static, virtual machine based solution to an elastically scalable containerized solution orchestrated by EKS. Argo was used to replace a custom task orchestration solution that was difficult to deploy and maintain, and multiple node groups were used to optimize performance for specialized data science jobs.
Stratus10 delivered a solution consisting of EKS, EFS, and Argo solutions deployed using Cloudformation. EKS was used as the container orchestration solution both because the customer was familiar with Kubernetes and because the managed control plane requires very little administration and maintenance. Stratus10 used multiple managed node groups to allow for the use of different instance sizes for the specialized jobs. A robust tagging and labeling solution was also created to ensure the Kubernetes NodeSelector solution could be used to target pods to specialized nodes. The Horizontal Pod Autoscaler and the Cluster Autoscaler were used to elastically scale both pods and nodes within the cluster.
Amazon Elastic File System (EFS) was mounted to all nodes, providing access to the data for all workloads and eliminating the need for tasks that were transferring the data between systems. The EFS mount was delivered as a shared secret making it very easy to mount into pods.
Argo, Spark, and TensorFlow made up the primary components of the customer’s machine learning solutions. Stratus10 assisted in the deployment and configuration of containerized versions of each of these components.
Stratus10 delivered the entire solution through CloudFormation so Peachjar could reliably deliver the machine learning and data science cluster to multiple environments. They can also easily tear down the cluster during idle times and quickly bring the entire solution back up when it’s needed.
Results and Benefits:
The machine learning and data science EKS solution delivered by Stratus10 has met all performance and cost optimization goals set by the customer. Their workloads are now delivered through containerized solutions the customer is familiar and comfortable with, and the time and resources required to maintain their machine learning systems has dropped significantly. By targeting autoscaling node groups for specialized workloads, Peachjar can now run their jobs on larger, faster infrastructure for less than the previous static solution. Finally, the solution is highly repeatable across their environments and Peachjar can more confidently release changes to their ML solution.
Use case: App Modernization
Date: March 2020
Category: AI / ML / Containers / DevOps / Cost Optimization