Commentary

Microsoft Trained Small Language Models To Process, Better Understand Search Queries

by Laurie Sullivan , Staff Writer, December 18, 2024

Microsoft improved Bing with new large language models (LLMs) and small language models (SLMs), which the company says helped to reduce latency and cost associated with hosting and running search.

"Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search capabilities," the company wrote in a blog post. "While transformer models have served us well, the growing complexity of search queries necessitated more powerful models."

Microsoft trained SLMs to process and understand search queries more precisely. But one of the key challenges with large models is managing latency and cost. Nvidia TensorRT-LLM has been integrated into its workflow to optimize SLM inference performance.

One product in which Microsoft uses TensorRT-LLM is "Deep search" -- in order to provide the best possible web results to Bing users.

Microsoft bought nearly half a million graphic processing units (GPUs) this year to build artificial intelligence (AI) systems, reported the Financial Times. The FT cited analysts at Omdia, a technology consultancy, that estimate 485,000 of Nvidia’s “Hopper” chips this year, out buying even Meta, Nvidia's biggest customer, which bought 224,000.

The original Transformer model, Microsoft explains, "had a 95th percentile latency of 4.76 seconds per batch and a throughput of 4.2 queries per second per instance."

Each consisted of 20 queries, which means a batch of 20 questions or requests processed together by the LLM.

After integrating TensorRT-LLM, Microsoft managed to achieve a 95th percentile latency reduction to 3.03 seconds per batch and increased throughput to 6.6 queries per second per instance.

Latency is the time it takes for the LLM to process a request and provide a response. The 95th percentile means that 95% of the batches were processed in 3.03 seconds or less to significantly speed response time.

This optimization enhanced user experiences by delivering quicker search results and reduces operational costs of running these large models by 57%.

Microsoft said the transition to SLM models and the integration of TensorRT-LLM brought:

Faster search results: With optimized inference, users can enjoy quicker response times, making their search experience more seamless and efficient
Improved accuracy: The enhanced capabilities of SLM models allows Microsoft to deliver more accurate and contextualized search results, helping users find the information they need more effectively
Cost efficiency - by reducing the cost of hosting and running large models, Microsoft can continue to invest in further innovations and improvements, ensuring that Bing remains at the forefront of search technology

ad technology, ai, bing, fast, generative ai, language, microsoft, search, technology

Next story loading

About the Author

Laurie Sullivan is a writer and editor for MediaPost. You can reach Laurie at lauriesullivan@gmail.com.

More from Performance Marketing Insider

SPONSOR CONTENT

Head to Santa Barbara with CPG Marketers!