FxPaul

Math in finance or vice versa

Hidden Markov Model for application store ratings

Hidden Markov model (HMM) is a statistical model in which the system is assumed to be a Markov process with hidden states. Those states can be recovered by outputs, observed sequences. In other words, it is possible to infer some probabilistic properties of the system by outputs.

As an off-topic, application stores usually give ranking to apps by user comments and rankings. The simplest way to derive an app rating is to calculate average or median, i.e. some statistical property based on rating samples. For average rating not being a robust statistics, its value is affected by outliers, for instance, by deviant rankings submitted by users. Thus a robust procedure might be used to improve ranking.

In fact we can apply HMM mechanics to infer real application rating by the most likely explanation of observed user rankings. Let’s see how to do that.

Model description

The AppStore initial (not-learned) HMM has:

1. Observables – list of integers from 1 to 5 where 5 is the best
2. Hidden states – integers from 1 to 5 with P = 0.2 for each state
3. Probabilities of observable for given hidden state. Heuristic applies if observables should be close to hidden state, i.e. if hidden states is 3 then 3 is the most observed output, 2 and 4 have less probability and 1 and 5 is the lowest ones
4. Transitions between hidden states. Heuristic rule is to give the maximum weight to

The initial model is shown on the picture (click on it to enlarge). Note, that states are numbered from 0 to 4.

Training dataset retrieval

I used an unofficial Android market API to retrieve rating comments for top ten free applications and train the Markov model on their comments. Those apps were at the moment of writing this article:

• AppBrain App Market
• Lookout Security & Antivirus
• File Expert
• Android Assistant(17 features)
• Greatest Magic Performances!
• Sudoku Free
• US Yellow Pages
• Brightest Flashlight Free™

The final HMM is shown below (click on it to enlarge). The result summary is:

1. Ratings 1 and 2 are quite rare
2. Apps with rating 3 have a tendency to fall to 1 and 2
3. Rating 4 is relatively unstable as there are 46.1% chance to remain and 36.6% chance to move up
4. Rating 5 is quite stable as probability to remain at top is 90.3%

Applicability

Android Market API is protected against overwhelming querying, so it is pretty hard to obtain a significant dataset for further investigations of the model applicability. But the results are rather plausible.

Natural language processing might give further improvements to the model as well. That might be a next topic.

Written by fxpaul

November 2, 2011 at 17:10

Posted in thoughts

Tagged with , ,