Social Information Filtering: Algorithms for Automating "Word of Mouth''

Ringo provides a range of functions apart from making recommendations. For example, when rating an artist or album, a person can also write a short review, which Ringo stores. Two actual reviews entered by users are shown in Figure 4. Notice that the authors of these reviews are free to decide whether to sign these reviews or keep them anonymous. When a user is told to try or to avoid an artist, any reviews for that artist written by similar users are provided as well. Thus, rather than one ``thumbs-up, thumbs-down'' review being given to the entire audience, each user receives personalized reviews from people that have similar taste.

Tori Amos has my vote for the best artist ever. Her lyrics and music are very inspiring and thought provoking. Her music is perfect for almost any mood. Her beautiful mastery of the piano comes from her playing since she was two years old. But, her wonderful piano arrangements are accompanied by her angelic yet seductive voice. If you don't have either of her two albums, I would very strongly suggest that you go, no better yet, run down and pick them up. They have been a big part of my life and they can do the same for others. ---- user@place.edu}

I'd rather dive into a pool of dull razor blades than listen to Yoko Ono sing. OK, I'm exaggerating. But her voice is *awful* She ought to put a band together with Linda McCartney. Two Beatles wives with little musical talent.

Figure 4: Two sample reviews written by users.

In addition, Ringo offers other miscellaneous features which increase the appeal of the system. Users may add new artists and albums into the database. This feature was responsible for the growth of the database from 575 artists at inception to over 2500 artists in the first 6 weeks of use. Ringo, upon request, provides dossiers on any artist. The dossier includes a list of that artist's albums and straight averages of scores given that artist and the artist's albums. It also includes any added history about the artist, which can be submitted by any user. Users can also view a ``Top 30'' and ``Bottom 30'' list of the most highly and most poorly rated artists, on average. Finally, users can subscribe to a periodic newsletter keeping them up to date on changes and developments in Ringo.

ALGORITHMS AND QUANTITATIVE RESULTS

Ringo became available to the Internet public July 1, 1994. The service was originally advertised on only four specialized USENET newsgroups. After a slow start, the number of people using Ringo grew quickly. Word of the service spread rapidly as people told their friends, or sent messages to mailing lists. Ringo reached the 1000-user mark in less than a month, and had 1900 users after 7 weeks. At the time of this writing (September 1994) Ringo has 2100 users and processes almost 500 messages a day.

Like the membership, the size of the database grew quickly. Originally, Ringo had only 575 artists in its database. As we soon discovered, users were eager to add artists and albums to the system. At the time of this writing, there are over 3000 artists and 9000 albums in Ringo's database.

Thanks to this overwhelming user interest, we have an enormous amount of data on which to test various social information information filtering algorithms. This section discusses four algorithms that were evaluated and gives more details about the ``winning'' algorithm. For our tests, the profiles of 1000 people were considered. A profile is a sparse vector of the user's ratings for artists. 1,876 different artists were represented in these profiles.

To test the different algorithms, 20 % of the ratings in each person's profile were then randomly removed. These ratings comprised the target set of profiles. The remaining 80 % formed the source set. To evaluate each algorithm, we predicted a value for each rating in the target set, using only the data in the source set. Three such target sets and data sets were randomly created and tested, to check for consistency in our results. For brevity, the results from the first set are presented throughout this paper, as results from all three sets only differed slightly.

In the source set, each person rated on average 106 artists of the 1,876 possible. The median number of ratings was 75, and the most ratings by a single person was 772! The mean score of each profile, i.e. the average score given all artists by a user, was 3.7.

Evaluation Criteria

The following criteria were used to evaluate each prediction scheme:

The lower the mean absolute error, the more accurate the scheme. We cannot expect to lower | E | (as mean with bar over E) below the error in people's ratings of artists. If one provides the same list of artists to a person at different points of time, the resulting data collected will differ to some degree. The degree of this error has not yet been measured. However we would expect the error to at least be $\pm 1$ unit on the rating scale (because otherwise there would be 0 or no error).

The standard deviation of the errors,

should also be minimized. The lower the deviation, the more consistently accurate the scheme is.
Finally, T, the percentage of target values for which the scheme is able to compute predictions should be maximized. Some algorithms may not be able to make predictions in all cases.

Base Case Algorithm

A point of comparison is needed in order to measure the quality of social information filtering schemes in general. As a base case, for each artist in the target set, the mean score received by an artist in the source set is used as the predicted score for that artist. A social information filtering algorithm is neither personalized nor accurate unless it is a significant improvement over this base case approach.

FIGURE 5 depicts the distribution of the errors, E. | E | (as mean with bar over E) is 1.3, and the standard deviation o (sigma) is 1.6. The distribution has a nice bell curve shape about 0, which is what was desired. At first glance, it may seem that this mindless scheme does not behave too poorly. However, let us now restrict our examination to the extreme target values, where the score is 6 or greater or 2 or less. These values, after all, are the critical points. Users are most interested in suggestions of items they would love or hate, not of items about which they would be ambivalent.

Figure 5: The distribution of errors in predictions of the Base Algorithm.

The distribution of errors for extreme values is shown by the dark gray bars in Figure 5. The mean error and standard deviation worsen considerably, with | E | (as mean with bar over E) = 1.8 and o (sigma) = 2.0$. Note the lack of the desired bell curve shape. It is in fact the sum of two bell curves. The right hill is mainly the errors for those target values which are 2 or less. The left hill is mainly the errors for those target values which are 6 or greater.

For the target values 6 or greater, the mean absolute error is much worse, with | E | (as mean with bar over E) = 2.1. Why the great discrepancy in error characteristics between all values and only extreme values? Analysis of the database indicates that the mean score for each artist converges to approximately 4. Therefore, this scheme performs well in cases where the target value is near 4. However, for the areas of primary interest to users, the base algorithm is useless.

Social Information Filtering Algorithms

Four different social information filtering algorithms were evaluated. Due to space limitations, the algorithms are described here briefly. Exact mathematical descriptions as well as more detailed analysis of the algorithms can be found in [ 7 ]. .

The Mean Squared Differences Algorithm. The first algorithm measures the degree of dissimilarity between two user profiles, Ux (U subscript x) and Uy (U subscript y) by the mean squared difference between the two profiles:

Predictions can then be made by considering all users with a dissimilarity to the user which is less than a certain threshold L and computing a weighted average of the ratings provided by these most similar users, where the weights are inverse proportional to the dissimilarity.

The Pearson Algorithm. An alternative approach is to use the standard Pearson r correlation coefficient to measure similarity between user profiles:

This coefficient ranges from -1, indicating a negative correlation, via O, indicating no correlation, to +1 indicating a positive correlation between two users. Again, predictions can be made by computing a weighted average of other user's ratings, where the Pearson r coefficients are used as the weights. In contrast with the previous algorithm, this algorithm makes use of negative correlations as well as positive correlations to make predictions.

The Constrained Pearson r Algorithm. Close inspection of the Pearson r algorithm and the coefficients it produced prompted us to test a variant which takes the positivity and negativity of ratings into account. Since the scale of ratings is absolute, we ``know'' that values below 4 are negative, while values above 4 are positive. We modified the Pearson r scheme so that only when there is an instance where both people have rated an artist positively, above 4 , or both negatively, below 4, will the correlation coefficient increase. More specifically, the standard Pearson r equation was altered to become:

To produce recommendations to a user, the constrained Pearson r algorithm first computes the correlation coefficient between the user and all other users. Then all users whose coefficient is greater than a certain threshold L are identified. Finally a weighted average of the ratings of those similar users is computed, where the weight is proportional to the coefficient. This algorithm does not make use of negative ``correlations'' as the Pearson r algorithm does. Analysis of the constrained Pearson r coefficients showed that there are few very negative coefficients, so including them makes little difference.

The Artist-Artist Algorithm . The preceding algorithms deal with measuring and employing similarities between users. Alternatively, one can employ the use of correlations between artists or albums to generate predictions. The idea is simply an inversion of the previous three methodologies. Say Ringo needs to predict how a user, Murray, will like ``Harry Connick, Jr''. Ringo examines the artists that Murray has already rated. It weighs each one with respect to their degree of correlation with ``Harry Connick, Jr''. The predicted rating is then simply a weighted average of the artists that Murray has already scored. An implementation of such a scheme using the constrained Pearson r correlation coefficient was evaluated.

Results

A summary of some of our results (for different values of the threshold L) are presented in table \ref{tab:table2}. More details can be found in [7]. Overall, in terms of accuracy and the percentage of target values which can be predicted, the constrained Pearson r algorithm performed the best on our dataset if we take into account the error as well as the number of target values that can be predicted. The mean square differences and artist-artist algorithms may perform slightly better in terms of the quality of the predictions made, but they are not able to produce as many predictions.

Table 1: Summary of Results.

As expected, there is a tradeoff between the average error of the predictions and the percentage of target values that can be predicted. This tradeoff is controlled by the parameter L, the minimum degree of similarity between users that is required for one user to influence the recommendations made to another.

FIGURE 6 illustrates the distribution of errors for the best algorithm with the threshold L equal to 0.6. The distribution for extreme values approaches a bell curve, as desired. The statistics for all values and extreme values are | E | (as mean with bar over E) = 1.1, o (sigma) = 1.4 and | E | (as mean with bar over E) = 1.2, o (sigma) = 1.6, respectively. These results are quite excellent, especially as the mean absolute error for extreme values approaches that of all values. At this threshold level, 91 % of the target set is predictable.

QUALITATIVE RESULTS

Ultimately, what is more important than the numbers in the previous section is the human response to this new technology. As of this writing over 2000 people have used Ringo. Our source for a qualitative judgment of Ringo is the users themselves. The Ringo system operators have received a staggering amount of mail from users--- questions, comments, and bug reports. The results described in this section are all based on user feedback and observed use patterns.

One observation is that a social information filtering system becomes more competent as the number of users in the system increases. FIGURE 7 illustrates how the error in a recommendation relates to the number of people profiles consulted to make the recommendation. As the number of user scores used to generate a prediction increases, the deviation in error decreases significantly. This is the case because the more people use the system, the greater the chances are of finding close matches for any particular user. The system may need to reach a certain {\em critical mass} of collected data before it becomes useful. Ringo's competence develops over time, as more people use the system. Understandably then, in the first couple weeks of Ringo's life, Ringo was relatively incompetent. During these days we received many messages letting us know how poorly Ringo performed. Slowly, the feedback changed. More and more often we received mail about how "unnervingly accurate'' Ringo was, and less about how it was incorrect. Ringo's growing group of regular "customers'' indicates that it is now at a point where the majority of people find the service useful.

Figure 7 Caption.

However, many people are disappointed by Ringo's initial performance. We are often told that a person must do one or two iterations of rating artists before Ringo becomes accurate. A user would rate the initial set, then receive predictions. If the user knows any of the predicted artists are not representative of their personal taste, they rate those artists. This will radically alter the members of the user's ``similar user'' neighborhood. After these iterations, Ringo works satisfactorily. This indicates that what is needed is better algorithm for determining the ``critical'' artists to be rated by the user so as to distinguish the user's tastes and narrow down the group of similar users.

Beyond the recommendations, there are other factors which are responsible for Ringo's great appeal and phenomenal growth. The additional features, such as being a user-grown database, and the provisions for reviews and dossiers add to its functionality. Foremost, however, is the fact that Ringo is not a static system. The database and user base is continually growing. As it does, Ringo's recommendations to the user changes. For this reason, people enjoy Ringo and use it on a regular basis.

RELATED WORK

Several other attempts have been made at building filtering services that rely on patterns among multiple users. The Tapestry system \cite{tapistry} makes it possible to request Netnews documents that have been approved by other users. However, users must themselves know who these similar people are and specifically request documents annotated by those people. That is, using the Tapestry system the user still needs to know which other people may have similar tastes. Thus, the social information filtering is still left to the user.

During the development of Ringo, we learned about the existence of similar projects in a similar state of development. One such example is Grouplens [4], a system applying social information filtering to the personalized selection of Netnews. GroupLens employs Pearson r correlation coefficients to determine similarity between users. On our dataset, the algorithms described in this paper performed better than the algorithm used by Grouplens.

Two other recently developed systems are a video recommendation service implemented at Bellcore, Morristown, NJ and a movie recommendation system developed at ICSI, Berkeley, CA. Unfortunately, as of this writing, there is no information available about the algorithms used in these systems, nor about the results obtained.

The user modeling community has spawned a range of recommendation systems which use information about a user to assign that user to one of a finite set of hand-built, predefined user classes or stereotypes. Based on the stereotype the user belongs to, the system then makes recommendations to the user. For example [5] recommends novels to users based on a stereotype classification. This method is far less personalized than the social information filtering method described in this paper. The reason is that in social information filtering, in a sense every user defines a stereotype that another user can to some degree belong to. The number of stereotypes which is used to define the user's taste is much larger.

Finally, some commercial software packages exist that make recommendations to users. An example is Movie Select, a movie recommendation software package by Paramount Interactive Inc. One important difference is that these systems use a data set that does not change over time. Furthermore, these systems also do not record any history of a person's past use. As far as can be deduced from the software manuals and brochures, these systems store correlations between different items and use those correlations to make recommendations. As such the recommendations made are less personalized than in social information filtering systems.