Around the Net

News Sites Push For More Robots.txt Control

The Associated Press , Thursday, November 29, 2007 1:45 PM

Most Webmasters learn about the functionality of the robots.txt file very quickly, since incorrect use of the protocol can make it impossible for the search engines to crawl and index the site. But there is no trade organization backing use of the Robots Exclusion Protocol -- so technically, it's not really a standard, just a format that the engines and the site owners have agreed upon. Until now.

A global group of publishers is rallying behind a set of extensions to the robots.txt protocol called Automated Content Access Protocol (ACAP). Comprised mainly of publishers and news organizations like the Associated Press, the consortium is fighting for more control over what content the search engines can gain access to.

ACAP currently only offers provisions for text and still images, but publishers can choose to limit how long the engines can cache their content, enforce page-wide no follows, and other options. The protocol has been tested with French search engine Exalead for functionality.

Publishers -- and more importantly, the engines -- will have to adopt the protocol on a large scale for it to gain traction. Google for example, is arguing that ACAP hasn't been proven effective for publishers outside of the news media. And Danny Sullivan notes that while robots.txt is definitely in need of an upgrade, online retailers and blog sites might need provisions that ACAP doesn't cover.

Read the whole story at The Associated Press »

Next story loading