Detecting Deceptive Opinion Spam

Ever seen a review like this?

My husband and I satayed for two nights at the Hilton Chicago,and enjoyed every minute of it! The bedrooms are immaculate,and the linnens are very soft. We also appreciated the free wifi,as we could stay in touch with friends while staying in Chicago. The bathroom was quite spacious,and I loved the smell of the shampoo they provided-not like most hotel shampoos. Their service was amazing,and we absolutely loved the beautiful indoor pool. I would recommend staying here to anyone.

Your author has come across dozens that started like this one: with “My husband and I” or “My spouse and I”. Surfing over to Yelp and choosing San Francisco brings up another, “My husband stayed here for a little less than a week and were extremely pleased with the place…”

Turns out there’s a good reason for this similarity: many of these reviews are fake, put there by mercenaries, making as little as $5 for two, necessarily glowing, reviews. The $5 figure is from the New York Times, via A&LD. Bogus “five-star” ratings on sites like Amazon and TripAdvisor turn out to be a large problem.

The glowing notice above is known to be fake because it was solicited via a website that specializes in selling fake reviews (I have no idea whether the Yelp review is real or genuine). This solicitation was done as part of a study by Myle Ott and others at Cornell in an effort to develop an algorithm that can detect fakes.

Incidentally, Ott is a computer scientist, and those guys say “train algorithm” when statisticians say “fit model” or physicists say “build model.” All these terms mean exactly the same thing—though, admittedly “training an algorithm” sounds sexier than “fitting a model.” “Training” implies that “learning” can go on indefinitely, while “fitting” implies merely applying some formula. Computer scientists are winning the battle of terminology. They are also—justifiably—winning the battle over the philosophy of modeling, but that’s a story for another day.

Building the algorithm to determine fraudulent reviews is not simple; however, creating the database from which to fit the model is the real trick. One approach was to gather reviews which are too similar, vis à vis plagiarism. Another was to “ask participants to give both their true and untrue views on personal issues (e.g., their stance on the death penalty).” Everybody becomes their own control in this way.

Here, the authors did one better and solicited 400 fake reviews in the same way that fake reviews are solicited by actual websites. They also gathered 400 hoped-to-be-genuine reviews from TripAdvisor. In the end, they had 20 real and 20 fake reviews for 20 different hotels. These were used to fit their model—or train their algorithm, if you will.

One tidbit was the discovery that fake reviews are often written in a hurry. One “took just 5 seconds and contained 114 words.” This of course implies the text was prepared in advance and cut and pasted in. Reviews written by first-time users, or newly created users names, are also more likely to be fake. Sites like TripAdvisor can use these facts as pieces of information to flag a review as genuine or fake.

The models themselves were naive Bayes and support vector machines, both commonly used as classifiers. Classification is the meat and potatoes of statistics (I would say it is the sole reason for its existence; of that, more another time). Logistic regression is classification, as are discriminant analysis, so-called machine learning algorithms, and on and on.

Support vector machines are a kind of non-parametric discriminant analysis. Various combinations of functions of data are produced which spit out whether the given message is likely fake or likely real. If you want to be fancy, you say SVMs “find a high-dimensional separating hyperplane between two groups of data.”

The data is the content of the messages themselves: how long it took them to be written, the number of times the word “I” was used, and so on. For example, deceptive reviews used “experience”, “my husband”, “I”, “feel”, “business”, and “vacation” more than genuine ones.

They got about 90% accuracy on their test data, which is excellent. Especially considering that human readers do no better than 50%. Experience says that that high rate won’t be realized on new data. Why?

Well, the model was fit to the data at hand. If new data was exactly like the data at hand, then the new accuracy rate would be the same as the old. But the new data is never exactly like the old data: if it was, it would be a mere copy. It is the inevitable differences between the old and new that account for the decrease in performance.

This wisdom applies not just to Ott’s model, but to all statistical/probabilistic or computer science/fuzzy logic models. The models’ performance is always conditional on the data at hand.


Ott has made his data publicly available. Do not download, however, unless you know how to read things like this, “!/.__The/DT ,/,__and/CC ,/,__and/CC ,/,__and/CC ,/,__and/CC ,/,__as/IN ./.__I/PRP ./.__The/DT ./.__Their/PRP$ ./…”


  1. …“training an algorithm” sounds sexier than “fitting a model.”

    Spoken like someone who’s never watched Project Runway. They often have to generously pixelate the picture when the models come in for their fittings.

  2. My spouse and I….. oh nuts.

    We toured UK by car this spring and France last fall. we relied on travel-site reviews. I doubt if there were any phony reviews of any of the places we stayed (no satay at any of them) mostly B&Bs. Sometimes the downside conveyed by the review could be subtle but still communicate – “Nice place, but don’t get trapped by owner/host who talks too much.”

    Besides, if it is a saccharine review, you don’t get any useful information.

    We did stay at one place that seemed to be badly reviewed. We could tell that we would not have liked the reviewers. It was a wonderful place, if you could deal with a small room.

  3. On Amazon, I always check to see the other reviews somebody has provided. If they are single reviews, I don’t even bother commenting, no matter how delusional the review is. Good reviewers on Amazon and TripAdvisor are very helpful and IMHO add real value to a site.

  4. I don’t see how the time used in composing a review can be considered valid.

    Depending on comment length, I might type it in the box provided or I might type into my text editor and copy/paste it into the box.

  5. That’s just great. Now we have an automated BS detector. What’s left for guys like me to do? Maybe I can get one of those green jobs.

  6. What is the difference between “satayed” and “sauteed”? If none, I presume it’s better in the Chicago Hilton than in a cast iron skillet albeit more costly. If I had my druthers I’d druther a “fit model” over a “trained algorithm” any day– especially with free wifi. Besides, everybody knows that models are high-strung and difficult to train.

  7. My husband read this blog post during his vacation and I feel pleased to say he was extremely satisfied with the experience. I would recommend this blog post to anyone whether for business or vacation. The staff at this blog post were courteous and professional. My husband and I feel it was a five star experience.

  8. Wow! Finally something I can add to! Mr. Briggs, you nailed the description of modeling perfectly. I’d like to add the following though, as I see these as big differences between what the CS community does and what I see in other fields.

    1. Very often, unless researching a specific ‘learning algorithm’, the model builder is not as interested in knowing why or how something is working– they just want results. There is much less of a burden within the community to explain the inner workings and significance of the final model, and more of an emphasis on achieving low error rates.

    2. One thing that the ML (machine learning) community does that I haven’t seen so much in other fields is finding novel ways to encode the data. It’s uncommon to use raw data as is, and a lot of work is devoted to producing ‘features’. In the case of the BS detector they are using bigrams and trigrams, individual word frequencies, word categories, likely hapexlegoma counts, etc…

Leave a Comment

Your email address will not be published. Required fields are marked *