AnalyTeam, 2022
In this project, we soft-cluster1 the users of BeerAdvocate within a set of predefined categories. Based on that, we analyze how attractive the website is for each category over time. Additionally, we uncover the trends of these categories and build interesting user personas for the website’s administrators. By the end, we come up with a few conclusions about what is done right in BeerAdvocate and what could be improved in order to make the website more attractive to the beer lovers community. We also provide a more general result in the form of the conclusion that natural soft-clustering of users can help uncover user groups’ behaviors which provides room for improvement in many businesses, including those that heavily rely on recommender systems and providing custom user experiences. We have available to us a dataset of BeerAdvocate data spanning between 1996 and 2017.
1natural soft-clustering: this approach is “cluster together users that satisfy a human-interpretable condition” rather than “cluster together users that are similar based on a similarity metric”. Since one user can satisfy many conditions simultaneously or none at all, any user can belong to any given number of categories including none at all. That is what we mean by natural soft-clustering of users.
We use the dataset provided by the teaching team of “CS-401: Applied Data Analysis” at EPFL for the year 2022.
We use a score-based approach to soft-cluster users using scores that are easily interpretable to humans. All of our scores satisfy that the larger they are the more they emphasize that the user may belong to the given category.
A user who rates beers close to their average rating, meaning that on average this user deviates little from the average opinion on the beers he/she rates.
This category may be regarded as an indicator of the herding effect where users tend, on average, to stick to the average opinion when rating beers even if the average opinion may be unfair towards a beer.
A user that rates beers close to the BA score that is displayed on the website for most beers. According to BeerAdvocate administrators, the BA score is a reference score displayed on the website to give the users an idea about how a beer ranks among the beers of the same style.
Note that 94.5% of rated beers in the data available to us have a BA score.
For lack of a better name, we will reference these users as the EXP users referring to a somewhat “structured” way of rating beers that may indicate being influenced by the BA score displayed on the website.
A user that rates beers that have a few ratings only if not none at all. In the latter case, we consider that user to be the one who added the beer to the website. (yes, BeerAdvocate allows any user to add new beers to its database.)
These users, therefore, shed light (positively or negatively) on unpopular beers by rating them, and some of them help populate the website with new beers which enriches the user experience on the website.
A user that rates often enough beers that are slightly okay at best.
These users are willing to risk trying out bad beers. We are interested in such users because they may contribute to the visibility of underrated beers on the website and so indirectly to the user experience.
In order to choose adequate thresholds to classify the users based on the scores defined in the previous section, we look at the distributions of these scores. The figures below plot those distributions in real and logarithmic scales:
The adventurer score has a heavy-tail distribution. By the interpretability of this score, a cut-off or threshold at 0.2 classifies as adventurers the users for which at least 20% of the beers they rate have a slightly okay at best rating at the moment at which they rate them. We use this threshold that corresponds to the $90^{th}$ percentile of the ADV score distribution.
The conformist score also has a heavy-tail distribution. Here again, we choose the threshold that corresponds to the $90^{th}$ percentile that is -0.35, which by the interpretability of this score means that we classify as conformists the users who deviate on average by less than 0.35 from the average opinion.
The EXP score looks more like a skewed gaussian, but it is not since it is heavy-tailed. Again by the interpretability of this score, a threshold at -0.2 classifies users that deviate on average by less than 0.2 from the displayed BA score as EXP users.
The explorer score also follows a heavy-tailed distribution. To reiterate, by the interpretability of this score, given a threshold of approximately 0.2, we classify as explorers the users that figure in the first 10 raters of at least 20% of the beers they rate.
Since we do soft clustering, users have a score for each category. But how users who meet the threshold score be in multiple categories? We used a simple Venn diagram to visualize the overlap of categories. The Venn diagram shows the number of users for the categories CFM, ADV, and XPL, and the number of users who are at the intersection of multiple categories.
As we can see, there are very few overlaps between explorers, conformists, and adventurers. There are some users with two categories, but almost no users with all 3 categories. This suggests that the categories that we defined target different kinds of users.
In the following sections, we will study the categories defined previously on many levels: beer style preferences, locations, ratings, reviews…etc to extract as much information about these categories as possible to build relevant personas that can be leveraged by the administrators to improve the UX of their website.
In this section, we highlight the ratings and reviews tendencies of the four categories: XPL, ADV, CFM, and EXP.
In order to know if the number of ratings characterizes in some way each category, we plot the likelihood of a user belonging to a category as a function of the range in which his/her number of ratings falls. Each range corresponds to an inter-quantile range of around 25% of the density of the number of ratings of all the studied users. As a quick reminder, those are the users from English-speaking countries having at least 5 ratings. The figure below shows the result.
It is worth pointing out that we only consider 99% of the density as the remaining 1% contains only super users which have many folds more ratings than normal users. Since there are very few super users, their results may be non-conclusive because they may be non-representative of super users overall that is why we treat them as outliers and discard them here.
We notice the following trends:
We can therefore deduce the following:
In the same spirit, we filter the users having at least one review and repeat the process. The figure below shows the result:
This time, we highlight the most important trend which is that the explorers (XPL) are overall the most likely to provide a review for the beers they rate.
For our analysis, we took only users from the USA, Canada, UK, and Australia. Note that most users in BeerAdvocate come from the USA. Is the distribution of country the same for all categories? The following interactive plot (you can select a category) shows the distributions of countries for the users for all users (ALL), conformist (CFM), explorer (XPL), adventurer (ADV), and EXP users.
We can conclude from the plot that conformists (CFM) and EXP users (EXP) have the same country distribution as the distribution of all users regardless of categories.
However, we see a different distribution for explorers (XPL) and adventurers (ADV), with a higher proportion of users from these categories coming from Canada, England, and Australia. This could potentially be due to a tendency for these users to rate beers from their own country or region more often. Indeed, since Canadian, English and Australian beers are rarer in BeerAdvocate, if the users from these countries rate the beers from their country more, they are more likely to rate beers with few or bad ratings.
The pie chart shows only the distribution of categories at the country level. However, we need to zoom into the USA to have a bigger picture since it’s where most users on BeerAdvocate come from. In order to get a clearer look at these disparities per location, we can decompose the USA into the different existing states. The following plot shows the distribution of the US state by categories.
Clearly, there is a high disparity in the percentages of users per location. CFM and EXP vary between twice and once time the overall categories value (10% and 3% of the user respectively). However, the XPL and ADV categories have higher disparities with Austria, England, and Canada having extremely high percentages. 60% of Austrian users are explorers this is more than 4 times the overall percentage of 13%. The concentration of XPL in these countries can be explained by the extremely low number of selected users in these countries. Austria 229 Canada 1593 and England 329. This is in fact so low that when a user rates a local beer there is a high chance that this beer has been rated less than 10 times in the meantime classifying them as XPL. In order to explain the ADV spike in these countries we can look at the average rating of the beers that these 3 countries’ users rate indeed Austria is the country with the users rating the beers with the lowest average rating (3.64), England is the second (3.70) and Canada is the 4th (3.75). This explains why Australian, English, and Canadian users tend to rate beers with a low average rating and be classified as ADV.
Some regions in the world have a high reputation in terms of brewing beers. We will first analyze in which country the beer was brewed and where the ratings of users go to. The following interactive plot (you can click on categories) shows, for each category, the percentage of ratings that go to each location (US state and countries) ranked by highest percentage first.
We can see a similar pattern for all categories: they rate beers from where they come from a lot. Indeed, we see that most ratings go to the most populous states of the US. However, by selecting adventurer (ADV) and overall (ALL) only on the plot, we see that adventurers, while still rating beers from California a lot, have a different behavior than the general rater. By selecting only adventurer (ADV) we see that they also rate beers from other countries a lot, such as beers from Canada, Germany, Belgium, and England for example.
While Canada and England may not be surprising, since a lot of adventurers come from these two countries. The other countries are more surprising since there are no users from Belgium and Germany among the users we considered. Beers from Belgium, which are very famous have a high percentage of ratings across all categories, however, 5% of ratings from adventurer users go to beers brewed in Germany, while only 2% of ratings from users, in general, go to beers brewed in Germany. It seems like adventurers are more likely to rate beers from other countries compared to the other categories.
This behavior can be compared to explorers who also rate a lot of beers coming from Canada (we see that many explorers are Canadian), but they don’t rate beer from Germany as often as adventurers do.
We see that adventurers (ADV) don’t rate the same kinds of beers compared to the other user categories. However, do they have similar likings in terms of beer provenance? The following interactive plot (you can click on a category) shows the ratio between the average rating for each location compared to the average rating of a category of users. If the average rating for a location is above 1 (the dashed line), it means that users from the selected category like beers from that location more compared to the other beers they rated.
By selecting all categories, we see that conformists (CFM), EXP users, and explorers (XPL) mostly agree on the best beer provenances. Moreover, they all have similar likings compared to users in general (ALL). However, adventurers have different preferences. By clicking on adventurers (ADV) and overall (ALL) on the plot, we see, for example, that they like beers from Belgium more compared to the general users (ALL). There is also a huge difference in beers from Scotland. Overall, they have very different opinions compared to the other categories.
Now that we concluded that conformists (CFM), EXP users, and explorers (XPL) have similar behaviour in terms of ratings of beers towards the location of origin of beers and that adventurers (ADV) have different behavior. We can see our conclusion also applies to beer styles, which is also a very opinionated topic. The following interactive plot (you can click on categories) is similar to the first plot for the location of the origin of beers. It displays for each category, the percentage of ratings that go to each bear style ranked by highest percentage first.