Microsoft improved Bing with new large language models (LLMs) and small language models (SLMs), which the company says helped to reduce latency and cost associated with hosting and running search.
"Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search capabilities," the company wrote in a blog post. "While transformer models have served us well, the growing complexity of search queries necessitated more powerful models."
Microsoft trained SLMs to process and understand search queries more precisely. But one of the key challenges with large models is managing latency and cost. Nvidia TensorRT-LLM has been integrated into its workflow to optimize SLM inference performance.
advertisement
advertisement
One product in which Microsoft uses TensorRT-LLM is "Deep search" -- in order to provide the best possible web results to Bing users.
The original Transformer model, Microsoft explains, "had a 95th percentile latency of 4.76 seconds per batch and a throughput of 4.2 queries per second per instance."
Each consisted of 20 queries, which means a batch of 20 questions or requests processed together by the LLM.
After integrating TensorRT-LLM, Microsoft managed to achieve a 95th percentile latency reduction to 3.03 seconds per batch and increased throughput to 6.6 queries per second per instance.
Latency is the time it takes for the LLM to process a request and provide a response. The 95th percentile means that 95% of the batches were processed in 3.03 seconds or less to significantly speed response time.
This optimization enhanced user experiences by delivering quicker search results and reduces operational costs of running these large models by 57%.
Microsoft said the transition to SLM models and the integration of TensorRT-LLM brought: