If you haven’t read my spotlight article on the Amazon affiliate experiment that I performed, I’d suggest reading that first as to familiarize yourself with what I’m trying to do here. This article outlines my experience with porting the ReviewEngine project to C#, as well as re-designing key aspects of the project to provide site visitors with data that is actually useful. Also, it should be noted that this article is being written as development progresses, so as to outline the development history in an accurate way. With that being said, let’s get started:
Previously, ReviewEngine had been programmed using Java and the Jaunt web-scraping API. While there is nothing wrong with this, I have taken a liking to C# over Java, and I prefer to use C# to continue development. Since the Jaunt API library is Java-only, I am switching to HTMLAgilityPack, which is open source and works with .NET.
Update #1 – (11/13/15)
Scraping Amazon is easy. The site is chock full of useful identifiers that make it extremely easy to locate the data you need. While site redesigns can pose a problem, I can update the scraper methods when needed to keep the operation running smoothly. Additionally, Amazon offers a Product API which provides all sorts of useful metadata. There’s just one problem: it doesn’t allow access to reviews.
Because of this, downloading the raw HTML of the webpages scraped is the best option for now. It isn’t all bad though; Navigating to http://amazon.com/product-reviews/%ASIN%/ lists ten reviews at a time, and with a bit of trickery, pagination isn’t a problem either. All you need to do is check for an enabled “Next Page” button.
The scraper itself is fairly straightforward. It is contained in the AmazonProduct class, which when instantiated self-populates all the data needed. This allows me to scrape ASINs (Amazon Product IDs) from category listings and send them straight to new objects.
A guest speaker for a local company called Urban Science introduced me to the MapReduce algorithm concept. While this is commonly used for cluster computing, it can still be applied to what I’m trying to do here. I wrote a threading engine that simultaneously scrapes and then proceeds to split each review into a Dictionary containing <string,int> entries, where the string is the word and int is the number of times the word appears. Since this was all taken care of on a single thread, I could analyze data in bulk.
Of course, this data is fairly useless for Semantic Analysis without data correlating the usage of words to the overall opinion given in the review. To accomplish this (for now at least), I created a class called WordRepository. This is used to store all occurrences of words, and will ultimately train itself as reviews are scraped. Here’s how it works:
After the review is split into Dictionary entries, each word is added to the WordRepository, along with its respective count and review score. Articles and other junk words are thrown out. For example, say a review for a hard drive contains the word “slow” seven times, and the review itself gave the product two stars. The WordRepository would take that word and add it to the repository, each time recalculating the overall cumulative review score for that word. It isn’t perfect, but it will have to do for now.
Before this is even remotely useful, however, we’re going to have to train it. I plan to scrape one thousand ASINs in the electronics category to train the WordRepository on. I’ll talk about that in the next update.
Update #2 – 11/16/15
Semantic Learning Routine
After posting the previous update, I immediately started work on a learning routine for the WordRepository. Essentially, you pass it a single ASIN, and using the related items section located on a majority of product listings, it finds new targets. Once enough targets have been collected, it scrapes them, analyzes them, and factors them into the WordRepository. When the targets run out, it begins scraping again.
As of right now, the learning routine has been running for a bit over 72 hours and is still going strong. The WordRepository has just over 90,000 unique word entries taken from around 1600 products (or an exact 1,020,218 Reviews). This is a good start, and a lot of the entry scores that I have searched for by hand seem to decently correlate with the meaning of the words.
Now that there is enough data to begin analyzing reviews with,I can begin calculating my own estimate score based on the vocabulary used in each review. The algorithms are ready to be tested, and I plan on collecting Percentage Error values to determine how much the newly calculated score differs from the actual product score on Amazon. That is a topic for the next update, however.
Update #3 – 1/2/16
Upon implementing the algorithm to recalculate the review scores, I noticed a problem. Looking at the screenshot above, you may have noticed too: because a large majority of the reviews on Amazon are five stars, the average scoring has been skewed. This means that a large majority of scores remained positive even when the review conveyed distaste. To combat this, some changes were made to how the program gathered reviews and weighed them.
First, the scraper was modified to download an even number of reviews for each star rating, thus making the sample size for all possible scores equal. Then, other modifications were made to consider repetitions of each word as a weighing indicator: words used commonly will have very little weight and words that are less common but established in the repository will have a higher overall weight.