
Yahoo wants to give researchers more insight into how consumers interact with news about the latest Lexus RS 2016 model, Anaheim Ducks hockey team, Spalding basketballs, Star
Wars, Microsoft's transition to Windows 10 and updates to Bing, Internet-connected devices, and Google self-driving cars, among a plethora of other things. So the company is making 13.5 terabytes of
uncompressed data available for use and review. The move aims to spur innovation, per the company.
It's equal to more than 119 billion events consisting of interactions with news items. Yahoo
says it aims to spur innovation as part of the Yahoo Labs Webscope data sharing program.
The program is a reference library of scientifically useful datasets for non-commercial use by
academics and other scientists. Suju Rajan, director of research for personalized Science at Yahoo Labs, believes the release will help to promote independent research in the fields of
large-scale machine learning and recommendation systems, and will level the playing field between industrial and academic research.
"In our age of Big Data, 13.5 terabytes as an abstract
quantity isn't particularly noteworthy," says Scott Brinker, founder of ION Interactive. "My 13-inch Macbook Pro laptop came with a 1 TB drive. This specific set of data Yahoo's releasing is exciting
because it's the largest set of data on user behaviors that a company like Yahoo has ever released for academic research.
The sample, announced Thursday, is based on anonymized user
interactions from news feeds from the Yahoo home page, Yahoo News,Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. It was collected by recording the user-item interactions of about 20
million users from February to May 2015.
The dataset
provides demographic information in categories such as age, geographic area, and gender. The title, summary and key-phrases of the news articles read are included, and interaction data is stamped with
the user's local time and partial information of the device used to access the news feeds.
Yahoo estimates the public release of the largest-ever machine learning dataset to
the academic research community. With this release, the company aims to advance the field of large-scale machine learning and recommender systems, and to help level the playing field between
industrial and academic research.
User behavioral data is key. To wrap your head around the enormity of the quantity of user data, 1 byte is equal to 9.09494701773E-13 terabyte, or
9.31322574615E-10 gigabyte, or 13,824 gigabytes, according to a conversion calculator.