
Search marketers
should work with webmasters and know how to secure data that Google, Bing, Yahoo and other engines should not index. "Not for public release" often starts off as the text for pages that companies
don't want indexed, but marketers should pay more attention to locking down content behind logins that they don't want indexed by engines.
That's the basic message after talking with
Reliable-SEO Founder David Harry about the names and the Social Security numbers of 43,000 faculty, staff, students and alumni from Yale University that became publicly available via Google search for
about 10 months.
The breach occurred when a File Transfer Protocol (FTP) server where the university stored the data became searchable via Google as the result of a change to the engine's
search algorithm made in September 2010. That's when Google modified its search engine to find and index FTP servers.
There are two lines of code in a text file that effectively block the major
engines from indexing content. "You would be amazed at the stuff you can find on Google that people think is secure," Harry said, pointing to "User-agent: *" and "Disallow: /" as two lines of code that can block all robots from indexing specific
information and serving it up in search queries.
Google explains that it won't crawl or index content of pages blocked by robots.txt, but it may still index the URLs found on other pages on the
Web. As a result, the URL of the page -- and potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project -- can appear in
Google search results.
In order to use a robots.txt file, marketers or webmasters need to have access to the root of the domain. For those without access to the root of a domain, restrict
access using the robots meta tag.
Robots, however, do not have to obey the code. "A nefarious type would have his spider ignore the robots, though major search engines do respect them," Harry
said. "Ultimately, data should be secured with logins."
Harry said people need to block search engines via Robots.txt with any server or pages that they don't want indexed. He said Google even
does what is known as "crawling the deep Web," where their spiders will actually fill in forms. People are too lax in blocking search engines, and this leads to hacking.