Microsoft is making available a data set from Bing search and Cortana virtual assistant queries to researchers who want to train their artificial intelligence systems. The data set -- called Microsoft Machine Reading Comprehension or MS MARCO -- is an anonymized data set based on real queries typed into the Bing search engine. The goal is to help AI platforms understand questions in a conversational tone.
The white paper, released in late November from the Microsoft AI and Research team, outlines ways to "overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering." Answers to the queries are generated by humans, and the subset of those queries have multiple answers.
For every question in the dataset, researchers asked a crowdsourced worker to answer those they can and to mark relevant passages that provide supporting information for the answer. If they can’t answer the question, the researchers consider the question unanswerable and include a sample of those in MS MARCO.
Last week, Microsoft Ventures, the company's VC arm, announced a new fund for AI startups, which already supports a startup called Element AI based in Montreal. The company is working to build AI systems. It works with local startups trying to apply neural networks for commercial use.
Researchers plan to release one million queries and corresponding answers in the dataset. "We are currently releasing 100,000 queries with their corresponding answers to inspire work in reading comprehension and question answering along with gathering feedback from the research community, researchers wrote in the white paper.
The long-term goal of the research is to develop more advanced data sets to assess and facilitate research toward real, human-like reading comprehension.