Commentary

Search Engine Bridges Language Barrier

by Laurie Sullivan , Staff Writer, November 17, 2010

Crain

Georgia Tech researchers created a machine-learning model that enables Web sites to learn dialect and other vernacular to improve search experiences and performance when language for queries might become "unclear or unorthodox."

The algorithm, which focused on medical terms today, only supports three dialects -- informal, formal and technical --but could scale to dozens with a little work.

The technology takes longer search queries-about 10 to 20 words-in English and attempts to understand the topic, matching it with more technical language found in medical documents. It doesn't need to understand the language, because it focuses on the words not the word order. It won't work well with a two-to-three word query.

The tool, dubbed diaTM for dialect topic modeling, learns by comparing multiple medical documents written in different levels of technical language. Comparing multiple documents allows the algorithm to identify the medical conditions, symptoms and procedures associated with certain dialectal words or phrases.

Steven Crain, a Georgia Tech Ph.D. student in computer science and lead author of the paper that describes diaTM, developed the algorithm based on Latent Dirichlet Allocation (LDA) to account for people speaking in different sets of words from what they try to retrieve in searches. "We're trying to keep track of the difference, rather than hoping it works out in the end," he says.

The search platform learns how different words in the dialects-informal and formal--relate to topics. The search engine learns from all those who search on the site. Think crowdsourcing, but Crain says site visitors can personalize the experience by pulling from a profile or past browsing experiences. The tool doesn't offer anything like this today.

Educating diaTM meant Crain and his fellow researchers had to pull publicly available documents not only from WebMD but also Yahoo Answers, PubMed Central, the Centers for Disease Control & Prevention website, and other sources. After processing enough documents, diaTM learned the word "gunk," often meant "discharge."

Although the tests run by the Georgia Tech doctorial students focused on medical terminology, the algorithms could support a variety of topics. People who live in an Atlanta ghetto speak differently than those who live in rural Michigan or in Beverly Hills, Calif.

Crain says the team is considering making a prototype standalone search engine. It probably by itself wouldn't work well, but in combination with existing search algorithms would become much stronger. "This system would work well if a person had a need for information that's difficult to find in engines like Google or Bing," he says.

Google or Bing could integrate diaTM into their respective search technology to improve longer query searches and certain types of unusual queries provided the engines had the good training documents for the topic, Crain says.

Crain says he developed the technology to help consumers gain more control of their health by finding information about how to care for themselves and others.

Next story loading

About the Author

Laurie Sullivan is a writer and editor for MediaPost. You can reach Laurie at lauriesullivan@gmail.com.