Ray Integration for AWS Trainium and AWS Inferentia is Now Available

September 20, 2023 By Mark Otto 0

AWS Trainium and AWS Inferentia are now integrated with Ray on Amazon Elastic Compute Cloud (EC2). Ray is an open source unified compute framework that makes it easy to build and scale machine learning applications. Ray will now automatically detect the availability of AWS Trainium and Inferentia accelerators to better support high-performance, low-cost scaling of machine learning and generative artificial intelligence (AI) workloads. This means that users can now further accelerate model training and/or serving on AWS by sharding generative AI or Large Language Models (LLMs) using tensor parallelism on Trainium/Inferentia accelerators.

Ray on Amazon Trn1 (powered by Amazon’s purpose-built Trainium chips) instances will offer excellent price/performance for distributed training and fine tuning of PyTorch models on AWS. Similarly, Amazon EC2 Inf2 instances, powered by AWS Inferentia, are purpose-built by AWS to provide high performance and reduce inferencing costs.

Machine learning models on AWS AI Accelerators are deployed to containers using AWS Neuron software development kit (SDK) to optimize the machine learning performance of Trainium and Inferentia based instances. With Ray integration, users will be able to build low latency and low-cost inference pipelines using Inferentia via tensor parallelism through Ray Serve API.

This feature will be made available as part of the Ray 2.7.0 release and is made possible by integrating Ray with Transformers Neuron, an open source software package that enables users to perform Large Language Model inference on the second generation Neuron hardware. Visit the list of supported models available in Transformers Neuron.

In the Hugging Face repository, you’ll find an example that compiles Open LLAMA-3B Large Language Model (LLM) and deploys the model on an AWS Inferentia (Inf2) instance using Ray Serve. It uses transformers-neuronx to shard the model across devices/neuron cores via Tensor parallelism.

AWS Trainium integration with high-level Ray Train API is currently in progress and the latest updates can be tracked via this link.

Get started today by cloning this repo and run the example from your local machine!