IBM has created a search tool that allows North Carolina State University to crawl through massive amounts of Web data on blogs, forums, reports, industry related news portals and government Web sites. Similar to a search engine bot, the query gathers data and produces a short list of potential investors for projects.
The NC State's Office of Technology Transfer manages more than 3,000 technologies invented by students, facility and staff. The seven member staff typically manually searches the Internet looking for potential investors for projects to bring technologies to market.
"The analytics and the language tools take a user defined set of criteria and searches the Web," explains Billy Houghteling, director of NC State's Office of Technology Transfer. "For both pilots, we identified the sites and resources the tools needed to crawl. Both searched more than 1.4 million sites to find contacts. I can't fathom how long it would take for a member of my staff to do that type of exercise."
Historically, it would take between two and four months to identify a short list of potential investors. IBM's newly defined analytics "search engine" cuts that down to between 10 days and a couple of weeks. While the analytics tools validated the process, they also identified many new possible partners.
Developed in IBM Labs, the analytics technology-- BigSheets and Content Analyzer in the IBM Cognos analytics suite--used in the pilot crawls the Web and mines large amounts of unstructured data. The analysis, based on factors such as business relevancy, government policies, market needs and trends, cuts a time intensive and inefficient process.
While BigSheets, built on Hadoop technology, supports high-level ad hoc exploration of very large data sets, Content Analyzer provides sophisticated data analysis. Both tools offer a Web-based interface, but BigSheets provides a visualization feature highlighting the relationships between the data. Simply put, BigSheets keeps the data in its original format; ICA creates a data index while it scans the information.
Those using BigSheets would point the tool toward Web sites or data sources they want to mine and allow the application to collect the information. The person could then explore the data similar to a spreadsheet. Both tools crawl Web sites to find and collect data, but Content Analyzer indexes data and BigSheets parses it, storing bits and bytes in their original form.
Chris Spencer, emerging technology strategist at IBM, made it clear both tools follow Web site search and index guidelines presented in the site's robots.txt directives, so the tools are "friendly" crawls that follows the rules set by Web site owners.
NC State's has full use of the tool as they evaluate it, Spencer says. At the request of NC State, IBM continues to work with the university to determine other uses beyond requirements from the Office of Technology Transfer.