
Microsoft improved Bing with new large language models (LLMs) and small language models (SLMs), which the company says helped to reduce latency and cost associated with
hosting and running search.
"Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search
capabilities," the company wrote in a blog post. "While transformer models have served us well, the growing complexity of search queries necessitated more powerful models."
Microsoft trained
SLMs to process and understand search queries more precisely. But one of the key challenges with large models is managing latency and cost. Nvidia TensorRT-LLM has been
integrated into its workflow to optimize SLM inference performance.
advertisement
advertisement
One product in which Microsoft uses TensorRT-LLM is "Deep
search" -- in order to provide the best possible web results to Bing users.
Microsoft bought nearly half a million
graphic processing units (GPUs) this year to build artificial intelligence (AI) systems, reported the Financial Times. The FT cited analysts at Omdia, a technology consultancy, that
estimate 485,000 of Nvidia’s “Hopper” chips this year, out buying even Meta, Nvidia's biggest customer, which bought 224,000.
The original Transformer model,
Microsoft explains, "had a 95th percentile latency of 4.76 seconds per batch and a throughput of 4.2 queries per second per instance."
Each consisted of 20 queries, which means a batch of
20 questions or requests processed together by the LLM.
After integrating TensorRT-LLM, Microsoft managed to achieve a 95th percentile latency reduction to 3.03 seconds per batch and increased
throughput to 6.6 queries per second per instance.
Latency is the time it takes for the LLM to process a request and provide a response. The 95th percentile means that 95% of the batches were
processed in 3.03 seconds or less to significantly speed response time.
This optimization enhanced user experiences by delivering quicker search results and reduces operational costs of
running these large models by 57%.
Microsoft said the transition to SLM models and the integration of TensorRT-LLM brought:
- Faster search results: With optimized inference, users can enjoy quicker response times, making their search experience more seamless and
efficient
- Improved accuracy: The enhanced capabilities of SLM models allows Microsoft to deliver more accurate and contextualized
search results, helping users find the information they need more effectively
- Cost efficiency - by reducing the cost of hosting and
running large models, Microsoft can continue to invest in further innovations and improvements, ensuring that Bing remains at the forefront of search technology