If it wasn’t apparent in my last project spotlight, I am often inclined to apply my knowledge of programming towards fun, low rewarding money making schemes. After all, when it comes to website building and Search Engine Optimization (SEO), content is king, and by looking at the leading affiliate sites, it is evident that they all have strong content. The problem then shifts to acquiring content: you can create the content yourself, hire someone to create it for you, or use existing content and pass it off in your own unique way.

The first two options are pretty much impossible to do programmatically, so the third choice was the logical route for me to take. However, Google has a strict no spam policy, and scraped data often constitutes as spam. Despite the probability of what I’m doing being considered as breaking the rules, I had absolutely nothing to lose, so I began planning out how I would generate content for my affiliate site.

After some brainstorming, I had a plan: I would write a web scraper in Java that would scan through the electronics section of Amazon, find items over a specific price threshold, and the scrape all of the reviews for that item. Then, using the review’s score and helpfulness percentage (the thumbs up/down system that rates each review), the program would compute a weighted score that was used to sort the reviews from most to least helpful.

Next, I needed a site to host the content. I didn’t want to invest much into this project, as it was likely something that Google would throw a fit about, so I wrote a script that took the content and posted it to a Google Blogger blog. Each post had a bit of information about the product, photos of the product, and ten of the highest ranked reviews as per my weighting algorithm. After all was said and done, this is how the site looked:

Click to enlarge.

Click to enlarge.

Note that the links to the item contained an affiliate link to the product on Amazon. With this functionality, I posted a handful of reviews and waited to see what would happen.

Based on previous experience with SEO, I have found that Google highly favors content hosted on their own platforms, and since Blogger is owned and maintained by Google, a steady stream of content would likely mean that that my automated site would rank for keywords. All that I needed to do was ensure that content would be posted regularly, and to do this, some modifications would need to be made to my product poster program.

Blogger limits users to 50 posts per day, and it would look incredibly suspicious to have every product posted within minutes of one another. Luckily, Blogger supports post scheduling, and I could use that to my advantage. After a few adjustments, the poster would load all of the scraped products, schedule 50 of the products one hour apart from each other, and then wait 24 hours plus 5 minutes to avoid getting a “post limit exceeded” race condition. Scheduling all 50 posts in rapid succession like this is still suspicious looking activity, but I decided to roll with it for two reasons:

  1. As long as the Blogger service didn’t catch on (and it never did), the Google search algorithm would likely not take this into account, as it only sees the time posts actually go public (which is every hour).
  2. Frequent content is a huge factor in a site’s ranking (but for it to be useful, it needs to be “good” content.

Thanks to this new posting method, my operation was completely autonomous. All I had to do at this point was leave the program running in the tray and go about my day.

After about a week, Google indexed my site, and I began to see an influx of visitors for a large variety of keywords. It appeared that my plan was working after all.

I went from no visitors to 300 per day in 24 hours

I went from no visitors to 300 per day in 24 hours

By the end of May, and I had accumulated thousands of link clicks and a handful of sales through the Amazon Affiliate program, and Google had still not taken any action against my automated site.

In late July, things did indeed take a turn for the worse.

I woke up one day and checked my site’s analytics and saw that my organic traffic had practically flatlined. I knew that I had been penalized by Google, which was my largest source of organic traffic, but it appears that the method of content generation that I was using was not penalized by the algorithm itself. Rather, since the reviews were rearranged and parts of them emitted, GoogleBot noticed that my site was suspicious, but it did not take any automated action against my account. Rather, it had flagged the site for human inspection, and my site had been slapped with a manual action. For a search engine as large as Google, you can bet your ass a large portion of spam is automatically penalized by GoogleBot. This was not.



Regardless, I can’t say I didn’t see this coming. The great part about Google is that it’s designed to only index content that provides real value to the user. It’s the reason search results stay relevant and high quality. And lets be honest, the content I was posting to the site wasn’t very useful, if at all.

Going forward, I would like to expand on this idea. I started working on a similar project for review analytics in mid-July but didn’t get very far with it. I plan on restarting in C# in a few weeks, so that will be a story for another Project Spotlight.