Commentary

Yahoo Releases 13.5 Terabytes Of User-Behavior Data To Researchers

Yahoo wants to give researchers more insight into how consumers interact with news about the latest Lexus RS 2016 model, Anaheim Ducks hockey team, Spalding basketballs, Star Wars, Microsoft's transition to Windows 10 and updates to Bing, Internet-connected devices, and Google self-driving cars, among a plethora of other things. So the company is making 13.5 terabytes of uncompressed data available for use and review. The move aims to spur innovation, per the company.

It's equal to more than 119 billion events consisting of interactions with news items. Yahoo says it aims to spur innovation as part of the Yahoo Labs Webscope data sharing program.

The program is a reference library of scientifically useful datasets for non-commercial use by academics and other scientists. Suju Rajan, director of research for personalized Science at Yahoo Labs, believes the release will help to promote independent research in the fields of large-scale machine learning and recommendation systems, and will level the playing field between industrial and academic research. 

"In our age of Big Data, 13.5 terabytes as an abstract quantity isn't particularly noteworthy," says Scott Brinker, founder of ION Interactive. "My 13-inch Macbook Pro laptop came with a 1 TB drive. This specific set of data Yahoo's releasing is exciting because it's the largest set of data on user behaviors that a company like Yahoo has ever released for academic research.

The sample, announced Thursday, is based on anonymized user interactions from news feeds from the Yahoo home page, Yahoo News,Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. It was collected by recording the user-item interactions of about 20 million users from February to May 2015.

The dataset provides demographic information in categories such as age, geographic area, and gender. The title, summary and key-phrases of the news articles read are included, and interaction data is stamped with the user's local time and partial information of the device used to access the news feeds.

Yahoo estimates the public release of the largest-ever machine learning dataset to the academic research community. With this release, the company aims to advance the field of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research.

User behavioral data is key. To wrap your head around the enormity of the quantity of user data, 1 byte is equal to 9.09494701773E-13 terabyte, or 9.31322574615E-10 gigabyte, or 13,824 gigabytes, according to a conversion calculator.

1 comment about "Yahoo Releases 13.5 Terabytes Of User-Behavior Data To Researchers".
Check to receive email when comments are posted.
  1. Steve Baldwin from Didit, January 14, 2016 at 4:34 p.m.

    My hope is that none of this data leaks (which happened when AOL released a large volume of search data in August, 2006 (see: https://en.wikipedia.org/wiki/AOL_search_data_leak)

Next story loading loading..