This blog post was co-authored, and includes an introduction, by Zilong Bai, senior natural language processing engineer at Patsnap. You’re likely familiar with the autocomplete suggestion feature when you search for something on Google or Amazon. Although the search terms in these scenarios are pretty common keywords or expressions that we use in daily life, in some cases search terms are very specific to the scenario. Patent search is one of them. Recently, the AWS Generative AI Innovation Center collaborated with Patsnap to implement a feature to automatically suggest search keywords as an innovation exploration to improve user experiences on their platform. Patsnap provides a global one-stop platform for patent search, analysis, and management. They use big data (such as a history of past search queries) to provide many powerful yet easy-to-use patent tools. These tools have enabled Patsnap’s global customers to have a better understanding of patents, track recent technological advances, identify innovation trends, and analyze competitors in real time. At the same time, Patsnap is embracing the power of machine learning (ML) to develop features that can continuously improve user experiences on the platform. A recent initiative is to simplify the difficulty of constructing search expressions by autofilling patent search queries using state-of-the-art text generation models. Patsnap had trained a customized GPT-2 model for such a purpose. Because there is no such existing feature in a patent search engine (to their best knowledge), Patsnap believes adding this feature will increase end-user stickiness. However, in their recent experiments, the inference latency and queries per second (QPS) of a PyTorch-based GPT-2 model couldn’t meet certain thresholds that can justify its business value. To tackle this challenge, AWS Generative AI Innovation Center scientists explored a variety of solutions to optimize GPT-2 inference performance, resulting in lowering the model latency by 50% on average and improving the QPS by 200%. Large language model inference challenges and optimization approaches In general, applying such a large model in a real-world production environment is non-trivial. The prohibitive computation cost and latency of PyTorch-based GPT-2 made it difficult to be widely adopted from a business operation perspective. In this project, our objective is to significantly improve the latency with reasonable computation costs. Specifically, Patsnap requires the following: The average latency of model inference for generating search expressions needs to be controlled within 600 milliseconds in real-time search scenarios The model requires high throughput and QPS to do a large number of searches per second during peak business hours In this post, we discuss our findings using Amazon Elastic Compute Cloud (Amazon EC2) instances, featuring GPU-based instances using NVIDIA TensorRT. In a short summary, we use NVIDIA TensorRT to optimize the latency of GPT-2 and deploy it to an Amazon SageMaker endpoint for model serving, which reduces the average latency from 1,172 milliseconds to 531 milliseconds In the following sections, we go over the technical details of the proposed solutions with key code snippets and show comparisons with the customer’s status quo based on key metrics. GPT-2 model overview Open AI’s GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on the WebText dataset, containing 8 million web pages. The GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and let it generate a lengthy continuation. In this situation, we exploit it to generate search queries. As GPT models keep growing larger, inference costs are continuously rising, which increases the need to deploy these models with acceptable cost. Achieve low latency on GPU instances via TensorRT TensorRT is a C++ library for high-performance inference on NVIDIA GPUs and deep learning accelerators, supporting major deep learning frameworks such as PyTorch and TensorFlow. Previous studies have shown great performance improvement in terms of model latency. Therefore, it’s an ideal choice for us to reduce the latency of the target model on NVIDIA GPUs. We are able to achieve a significant reduction in GPT-2 model inference latency with a TensorRT-based model on NVIDIA GPUs. The TensorRT-based model is deployed via SageMaker for performance tests. In this post, we show the steps to convert the original PyTorch-based GPT-2 model to a TensorRT-based model. Converting the PyTorch-based GPT-2 to the TensorRT-based model is not difficult via the official tool provided by NVIDIA. In addition, with such straightforward conversions, no obvious model accuracy degradation has been observed. In general, there are three steps to follow: Analyze your GPT-2. As of this writing, NVIDIA’s conversion tool only supports Hugging Face’s version of GPT-2 model. If the current GPT-2 model isn’t the original version, you need to modify it accordingly. It’s recommended to strip out custom code from the original GPT-2 implementation of Hugging Face, which is very helpful for the conversion. Install the required Python packages. The conversion process first converts the PyTorch-based model to the ONNX model and then converts the ONNX-based model to the TensorRT-based model. The following Python packages are needed for this two-step conversion: tabulate toml torch sentencepiece==0.1.95 onnx==1.9.0 onnx_graphsurgeon polygraphy transformers Convert your model. The following code contains the functions for the two-step conversion:
def torch2onnx(): metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=True), other=GPT2Metadata(kv_cache=False)) gpt2 = GPT2TorchFile(model.to('cpu'), metadata) onnx_path = ('Your own path to save ONNX-based model') # e.g, ./model_fp16.onnx gpt2.as_onnx_model(onnx_path, force_overwrite=False) return onnx_path, metadata def onnx2trt(onnx_path, metadata): trt_path="Your own path to save TensorRT-based model" # e.g., ./model_fp16.onnx.engine batch_size = 10 max_sequence_length = 42 profiles = [Profile().add( "input_ids", min=(1, 1), opt=(batch_size, max_sequence_length // 2), max=(batch_size, max_sequence_length), )] gpt2_engine = GPT2ONNXFile(onnx_path, metadata).as_trt_engine(output_fpath=trt_path, profiles=profiles) gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config, max_sequence_length=42, batch_size=10)
Latency comparison: PyTorch vs. TensorRT JMeter is used for performance benchmarking in this project. JMeter is an Apache project that can be used as a load testing tool for analyzing and measuring the performance of a variety of services. We record the QPS and latency of the original PyTorch-based model and our converted TensorRT-based GPT-2 model on an AWS P3.2xlarge instance. As we show later in this post, due to the powerful acceleration ability of TensorRT, the latency of GPT-2 is significantly reduced. When the request concurrency is 1, the average latency has been reduced by 274 milliseconds (2.9 times faster). From the perspective of QPS, it is increased to 7 from 2.4, which is around a 2.9 times boost compared to the original PyTorch-based model. Moreover, as the concurrency increases, QPS keeps increasing. This suggests lower costs with acceptable latency increase (but still much faster than the original model). The following table compares latency: . Concurrency QPS Maximum Latency Minumum Latency Average Latency Customer PyTorch version (on p3.2xlarge) 1 2.4 632 105 417 2 3.1 919 168 636 3 3.4 1911 222 890 4 3.4 2458 277 1172 AWS TensorRT version (on p3.2xlarge) 1 7 (+4.6) 275 22 143 (-274 ms) 2 7.2 (+4.1) 274 51 361 (-275 ms) 3 7.3 (+3.9) 548 49 404 (-486 ms) 4 7.5 (+4.1) 765 62 531 (-641 ms) Deploy TensorRT-based GPT-2 with SageMaker and a custom container TensorRT-based GPT-2 requires a relatively recent TensorRT version, so we choose the bring your own container (BYOC) mode of SageMaker to deploy our model. BYOC mode provides a flexible way to deploy the model, and you can build customized environments in your own Docker container. In this section, we show how to build your own container, deploy your own GPT-2 model, and test with the SageMaker endpoint API. Build your own container The container’s file directory is presented in the following code. Specifically, Dockerfile and build.sh are used to build the Docker container. gpt2