Commentary

How Search Engine Crawlers Locate Sensitive Data

by Laurie Sullivan , Staff Writer, August 23, 2011

Spider-Webb

Search marketers should work with webmasters and know how to secure data that Google, Bing, Yahoo and other engines should not index. "Not for public release" often starts off as the text for pages that companies don't want indexed, but marketers should pay more attention to locking down content behind logins that they don't want indexed by engines.

That's the basic message after talking with Reliable-SEO Founder David Harry about the names and the Social Security numbers of 43,000 faculty, staff, students and alumni from Yale University that became publicly available via Google search for about 10 months.

The breach occurred when a File Transfer Protocol (FTP) server where the university stored the data became searchable via Google as the result of a change to the engine's search algorithm made in September 2010. That's when Google modified its search engine to find and index FTP servers.

There are two lines of code in a text file that effectively block the major engines from indexing content. "You would be amazed at the stuff you can find on Google that people think is secure," Harry said, pointing to "User-agent: *" and "Disallow: /" as two lines of code that can block all robots from indexing specific information and serving it up in search queries.

Google explains that it won't crawl or index content of pages blocked by robots.txt, but it may still index the URLs found on other pages on the Web. As a result, the URL of the page -- and potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project -- can appear in Google search results.

In order to use a robots.txt file, marketers or webmasters need to have access to the root of the domain. For those without access to the root of a domain, restrict access using the robots meta tag.

Robots, however, do not have to obey the code. "A nefarious type would have his spider ignore the robots, though major search engines do respect them," Harry said. "Ultimately, data should be secured with logins."

Harry said people need to block search engines via Robots.txt with any server or pages that they don't want indexed. He said Google even does what is known as "crawling the deep Web," where their spiders will actually fill in forms. People are too lax in blocking search engines, and this leads to hacking.

bing, brand marketing, google, search, search marketing, yahoo

Next story loading

About the Author

Laurie Sullivan is a writer and editor for MediaPost. You can reach Laurie at lauriesullivan@gmail.com.