The Newest Front in the Online Wars: Splogs
A splog is a spam blog--that is, a fake blog that is created for the sole purpose of getting a high search engine "page rank" to reap profits through ad clicks, or to drive customers to an otherwise obscure e-commerce site. Just like e-mail spam, splogs don't take a rocket scientist to create, but can be built by simple automatic scripts or programs that abuse services like Blogspot, Moveable Type, Wordpress, or Google's Blogger.com.
To keep itself alive, a splog will crawl the Internet using directories, search engines, RSS feeds, etc., collecting information to give the appearance that a real person is adding content. In many cases, this involves automated "theft" of original and often copyrighted content from other authors, without their knowledge, permission, or even attribution.
There are lots of different kinds of splogs that have different ways to disguise themselves as real blogs, but commonly they contain key search terms repeated dozens or even hundreds of times. One researcher did a test on a "Dance teaching" spam blog, where the word "dance" was found 948 times on a single page. The total number of words on the page was around 2048. That means half of the page was "dance." Splogs often send any human visitor to an entirely different site, either through clickable links, or the more annoying practice of automated redirects.
To give you an idea of the magnitude of the problem, in the United Kingdom there is a company with over 15,000 spam blogs at last count. There were well over 10,000 spam blogs on BlogSpot alone related to the Triple Crown horse races. Of course, each time a visitor clicks on a paid search term, the advertiser pays for it and the "splogger" gets a revenue share.
Sploggers try to defend their actions by saying it should not matter to the advertiser where users find ads, as long as they are clicked on. Most advertisers are very concerned about the environment in which their ads appear, and would not only be surprised by traffic from splogs, but upset by most of it. It is the equivalent of having your ad sold into The New York Times, only to have it show up in some penny sheet in North Dakota.
From the reader's perspective, it is a serious issue as well, as most would prefer to read a story from the original source, not replicated out of context and certainly not with non-sequitor search terms inserted randomly into its text. Even more troubling is the prospect of the "infoblog" overwhelming the "individual" content to the point where the Blogosphere more closely resembles late-night TV than a forum for thoughtful discussion and exchange of information.
Blog search companies must maintain an aggressive stance on blog spam, and continue to hone their tools and techniques. Developing and deploying anti-spam tools for e-mail makes it clear that combating blog spam needs to be part of the search company core, not an afterthought or add-on.
Building on years of experience with cataloging blog posts, Feedster has implemented and is continuing to refine an integrated, multi-layer approach to quickly and accurately responding to spam. A handful of the "tricks" of e-mail spam identification are applicable to the world of blogs, but the problem is compounded, as sploggers don't have to use "email blast" software with its unique "signature" on the tens of millions of e-mails, but can use the "real" software to make their blogs.
Past the obvious looking for "Viagra" and all its misspellings, all search engines should employ a sophisticated approach to detecting blogs that are reasonably certain not to contain original content or commentary, past that intended solely for consumption by search engine crawlers. Blacklists of known splog domains are a good start--for eliminating their content, as well as those that refer to them to raise their mutual page rank. The way a blog is published can also tell you a lot about its source. For example, it wouldn't be surprising if there were absolutely no blogs published using WordPress on the .info domain with Google AdSense that weren't splogs. There is also the "human factor," both in terms of "I know spam when I see it" and in quickly responding to the inevitable misclassification of a blog as spam.
Can the war on splogs be won? No--there will always be those that try to "game" influential search engines. Just like e-mail spam, there is a balance between a pristinely spam-free result that discards some valuable content with an approach that eliminates the bulk of the spam, but makes sure that you don't miss anything important. In this rapidly changing world, Feedster is fighting to make sure that you can find the tiny gems of meaningful, timely content without having them be swept away with the splogs.
J. Scott Johnson is co-founder and chief technology officer of Feedster, Inc.