Using Time Clusters for Following Users ’ Shifts in Rating Practices

Users that enter ratings for items follow different rating practices, in the sense that, when rating items, some users are more lenient, while others are stricter. This aspect is taken into account by the most widely used similarity metric in user-user collaborative filtering, namely, the Pearson Correlation, which adjusts each individual user rating by the mean value of the ratings entered by the specific user, when computing similarities. However, a user’s rating practices change over time, i.e. a user could start as strict and subsequently become lenient or vice versa. In that sense, the practice of using a single mean value for adjusting users’ ratings is inadequate, since it fails to follow such shifts in users’ rating practices, leading to decreased rating prediction accuracy. In this work, we address this issue by using the concept of dynamic averages introduced earlier and we extend earlier work by (1) introducing the concept of rating time clusters and (2) presenting a novel algorithm for calculating dynamic user averages and exploiting them in useruser collaborative, filtering implementations. The proposed algorithm incorporates the aforementioned concept and is able to follow more successfully shifts in users’ rating practices. It has been evaluated using numerous datasets, and has been found to introduce significant gains in rating prediction accuracy, while outperforming the dynamic average computation approaches that are presented earlier.


Introduction
Collaborative filtering (CF), which is the most successful and most applied technique in the design of recommender systems [1], computes personalized recommendations, by taking into account the users' past likings and tastes, in the form of ratings entered in the CF rating database.
User-user CF algorithms firstly identify people having similar tastes, by examining the resemblance of already entered ratings; for each user u, other users having highly similar tastes with u are designated as u's nearest neighbors (NNs).Afterwards, in order to predict the rating that u would give to an item i, that u has not reviewed yet, the ratings assigned to item i by u's NNs are combined [2], under the assumption that users are highly likely to exhibit similar tastes in the future, if they have done so in the past as well [3], [4].To measure similarity between users, the Pearson Correlation Coefficient is the most commonly used formula in CF recommender systems.In this context, the Pearson correlation coefficient adjusts the ratings of a user u by the mean value of all ratings entered by u, so as to tackle the issue that some users may rate items either higher or lower than others.However, relying on a single mean value presumes that the users' marking practices remain constant over time; in practice though, a user's marking practices may change over time, i.e. a user could start off being lenient and subsequently change to being strict, or vice versa.
For instance, consider that a user watches 8 movies, day after day and rates them from 4/10 to 7/10 (assuming a uniform distribution, starting with 4/10 and ending with 7/10); hence a movie that has been rated with 4/10 is considered as a relatively bad one.Afterwards, the user abstains from watching/rating movies for 5 days and then she watches 4 movies, one every two days, rating them from 7/10 to 10/10 (starting with 10/10 and ending with 7/10, this time).We can understand that the first movie of the second rating set, that has been rated with 10/10 is considered as an excellent one, but we cannot be certain how the first period movie that has been rated with 7/10 compares with the second period movie that has been rated with the same mark: it may be the case that after the user has watched the movie that she considered excellent and rated with 10/10, her standards have risen, therefore in reality the second period movie that was rated with 7/10 is actually better than the equally ranked first period movie.It is also possible that the ratings entered by the user during the two periods were affected by a change in her mood [5].In any case, the user's 5-day abstention signifies a change in her rating practices.The Pearson Correlation, which is predominantly used for computing user similarity in CF algorithms, does not consider such changes in rating practices, and therefore inaccuracies may be introduced in the user neighborhood computation process.
Insofar, while many efforts have been made to improve the CF prediction accuracy, and the aspect of changes in users' interests has been extensively studied ([6] provides a comprehensive review), the issue of shifts in rating practices has not received adequate attention.Margaris and Vassilakis [7] introduce the concept of dynamic user rating averages which follow the users' marking practices shifts and present two alternative algorithms, namely the DA vicinity and DA previous algorithms, for computing a user's dynamic averages.These algorithms are validated in the context of user-user CF, and have been found to achieve better rating prediction accuracy than the plain CF algorithm.
In this article, we extend the work in [7] as follows: (1) We introduce the concept of rating time clusters; a rating time cluster corresponds to a group of ratings which have some temporal cohesion and for which the user is assumed to follow the same rating practice (2) We present a novel algorithm, namely DA clusters , for computing rating time clusters.After formulating rating time clusters, DA clusters computes one dynamic average per cluster; this dynamic average is then used in the rating prediction process.To validate our approach, we present an extensive evaluation, comparing the presented algorithm against the DA previous and DA vicinity algorithms proposed in [7] and the plain CF algorithm, which is used as a yardstick.Experiments have shown that using the per cluster dynamic averages, which are computed by the DA clusters algorithm leads to more accurate predictions as compared to the per-rating dynamic averages that are computed by the DA previous and DA vicinity algorithms proposed in [7].Improvement in accuracy indicates that the DA clusters algorithm is able to follow more accurately shifts in rating prediction practices.The DA clusters algorithm also introduces significant space savings in comparison to the DA previous and DA vicinity algorithms [7].
The proposed algorithm, as well as the two algorithms presented in [7], are based on the exploitation of timestamp information which is associated with ratings; hence in this work, we use the Amazon datasets [8], [9], the MovieLens datasets [10], [11] and the Netflix dataset [12], which include the ratings' timestamps.It is worth noting that the proposed algorithm can be combined with other techniques that have been proposed for either improving prediction accuracy in CF-based systems, including consideration of social network data (e.g.[13], [14], [15]), location data [16], [17] and pruning of old user ratings [18], [19], or techniques for speeding up prediction computation time, such as clustering [20], [21], [22].The rest of the article is structured as follows: Section 2 overviews related work, while Section 3 introduces the proposed algorithm and briefly describes the dynamic average-based algorithms presented in [7] for self-containment purposes.Section 4 evaluates the proposed algorithm using the aforementioned datasets and finally, Section 5 concludes the article and outlines future work.

Related Work
The accuracy of CF-based systems is a topic that has attracted considerable research efforts.Koren [23] proposes a new neighborhood-based model, which is based on formally optimizing a global cost function and leads to improved prediction accuracy, while maintaining merits of the neighborhood approach such as explainability of predictions and ability to handle new ratings (or new users) without retraining the model.In addition, it suggests a factorized version of the neighborhood model, which improves its computational complexity while retaining prediction accuracy.Liu et al. [24] present a new user similarity model to improve the recommendation performance when only few ratings are available to calculate the similarities for each user.The model considers the local context information of user ratings, as well as the global preference of user behavior.Ramezani et al. [25] propose a method to find the neighbor users based on the users' interest patterns in order to overcome challenges like sparsity and computational issues, following the idea that users who are interested in the same set of items share similar interest patterns, therefore, the non-redundant item subspaces are extracted to indicate the different patterns of interest and then, a user's tree structure is created based on the patterns she has in common with the active user.
Research has shown that exploiting time in the rating prediction computation can improve prediction accuracy, due to concept drift; concept drift is the phenomenon when the relation between the input data and the target variable changes over time [6].Change of interests [26], [6], is a typical example of concept drift.To this end, Zliobaite et al. [27] develop an intelligent approach for sales prediction, which uses a mechanism for model switching, depending on the sales behavior of a product.This research presents an intelligent two level sales prediction approach that switches the predictors depending on the properties of the historical sales, such as product moving average sales, cumulative sales, holidays and seasonal sales.This approach is shown to achieve better results as compared to both (a) a baseline predictor and (b) an ensemble of predictors.Ang et al. [28] address the problem of adaptation when external changes are asynchronous, by developing an ensemble approach, called PINE, which combines reactive adaptation via drift detection, and proactive handling of upcoming changes via early warning and adaptation across the peers.In addition, PINE is parameter-insensitive and incurs less communication cost while achieving better accuracy.Elwell and Polikar [29] tackle the issue of concept drift in the context of online learning, introducing a batch-based ensemble of classifiers, called Learn++.NSE, where NSE stands for Non-Stationary Environments.Learn++.NSE learns from consecutive batches of data without making any assumptions on the nature or rate of drift; it can learn from such environments that experience constant or variable rate of drift, addition or deletion of concept classes, as well as cyclical drift.The algorithm learns incrementally, as other members of the Learn++ family of algorithms, that is, without requiring access to previously seen data.Learn++.NSE trains one new classifier for each batch of data it receives, and combines these classifiers using a dynamically weighted majority voting.The algorithm is evaluated on several synthetic datasets designed to simulate a variety of nonstationary environments, as well as a real-world weather prediction dataset.Minku et al. [30] present a new categorization for concept drift, separating drifts according to different criteria into mutually exclusive and non-heterogeneous categories.Moreover, they present a diversity analysis in the presence of different types of drifts and it shows that, before the drift, ensembles with less diversity obtain lower test errors.Nishida and Yamauchi [31] have developed a detection method that includes an online classifier and monitors its prediction errors during the learning, which uses a statistical test of equal proportions.Experimental results showed that this method performed well in detecting the concept drift in five synthetic datasets that contained various types of concept drift.Vaz et al. [32] propose an adaptation of the item-based CF algorithm to incorporate rating age influence in predictions.It considers ratings in two dimensions: the active user ratings and the community ratings and it inserts a time weight, which gave more relevance to more recent ratings than to older ones, both in the similarity calculation and in the rating prediction equation.Koenigstein et al. [33] consider the temporal dimension in the context of recommender systems by capturing different temporal dynamics of music ratings, along with information from the taxonomy of music-related items; both these dimensions are exploited by a rich bias model.The method proposed in this work is applied on a sparse, large-scale dataset, and the particular characteristics of the dataset are extracted and utilized.Liu et al. [34] present a social temporal collaborative ranking model that can simultaneously achieve three objectives: (1) combines both explicit and implicit user feedback, (2) supports time awareness using an expressive sequential matrix factorization model and a temporal smoothness regularization function to tackle overfitting, and (3) supports social network awareness by incorporating a network regularization term.Dias and Fonseca [35] explore the usage of temporal context and session diversity in session-based CF techniques for music recommendation.They compare two techniques to capture the users' listening patterns over time: one explicitly extracts temporal properties and session diversity, to group and compare the similarity of sessions, the other uses a generative topic modeling algorithm, which is able to implicitly model temporal patterns.Results reveal that the inclusion of temporal information, either explicitly or implicitly, increases significantly the accuracy of the recommendation, as compared to the traditional session-based CF.
Li et al. [36] study the problem of predicting the popularity of social multimedia content embedded in short microblog messages, exploiting the idea of concept drift to capture the phenomenon that through the social networks' "re-share" feature, the popularity of a multimedia item may revive or evolve.They model the social multimedia item popularity prediction problem using a classification-based approach which is used for two sub-tasks, namely re-share classification and popularity score classification.Furthermore, they develop a concept driftbased popularity predictor by ensembling multiple trained classifiers from social multimedia instances in different time intervals.
Lu et al. [37] present a novel evolutionary view of user's profile by proposing a Collaborative Evolution (CE) model, which learns the evolution of user's profiles through the sparse historical data in recommender systems and outputs the prospective user profile of the future.
Kangasrääsiö et al. [38] formulate a Bayesian regression model for predicting the accuracy of each individual user feedback and thus find outliers in the feedback data set.Additionally, they introduce a timeline interface that visualizes the feedback history to the user and provides her with suggestions on which past feedback is likely in need of adjustment.This interface also allows the user to adjust the feedback accuracy inferences made by the model.The proposed modeling technique, combined with the timeline interface, makes it easier for the users to notice and correct mistakes in their feedback, and to discover new items.
However, none of the above mentioned works considers the issue of shifts in the users' rating practices.This issue has only recently received some attention: Margaris and Vassilakis [7] introduce and exploit the concept of dynamic user rating averages which follow the users' marking practices shifts.Furthermore, they present two alternative algorithms, namely the DA vicinity and the DA previous , for computing a user's dynamic averages and perform a comparative evaluation in the context of a user-user CF implementation.The results of this evaluation show that the dynamic average-based algorithms exhibit better performance than the plain CF algorithm in terms of rating prediction accuracy, at the expense of a small to tolerable drop in coverage.
This article extends the work presented in [7] by (1) introducing a more successful dynamic average computation algorithm, based on user-level rating clusters, which is able to better follow the variations of user rating practices and (2) validating its performance against widely used datasets with diverse characteristics.The newly introduced algorithm has been found to provide more accurate rating predictions by better capturing the shifts in users' rating practices.It is worth noting that the proposed algorithm is agnostic to the reasons that have led to the shifts in users' rating practices, such as adoption of different standards or changes in mood: the proposed algorithm focuses on identifying periods with distinct rating practices, and not on analyzing the reasons behind these shifts.

Exploiting Ratings' Timestamps in Users Dynamic Average Configuration
Ιn CF, predictions for a user X are computed based on a set of users which have rated items similarly with X; this set of users is termed "near neighbors of X" (X's NNs).The similarity metric for ratings is typically based on the Pearson correlation metric, which is expressed as: (1) where i ranges over items that have been rated by both X and Y.The algorithms presented in this section target the computation of (resp.), aiming to substitute the global average, which is insensitive to shifts in rating practices, by an average that is tailored to the time period that (resp. ) was entered.When a dynamic average computation algorithm DA Alg is employed, the above formula is modified as: (2)

Existing Dynamic Average Algorithms
In [7], two algorithms for computing dynamic user averages were proposed:  The dynamic average based on the temporal vicinity of the ratings, which will be denoted as DA vicinity , which follows a weighted average approach: for a rating r, each user rating r' posted by the same user is assigned a weight on the basis of its temporal vicinity to r (ratings that have been entered temporally close to r are assigned higher weights, and as temporal distance increases, the weights decrease), and finally the weighted average involving all ratings entered by the particular user is computed.The dynamic average for a rating r u,i on item i entered by user u is denoted as DA vicinity (r u,i ) and formally is computed as where w u,i (r) is the weight of rating r with respect to its temporal vicinity to rating r u,i and is calculated using formula (4): In formula (4), t(x) denotes the timestamp of rating x, whereas and denote the maximum timestamp and minimum timestamp, respectively, among the ratings entered by user u; this weight computation formula follows the standard normalization function presented in [39]. The dynamic average based only on previous ratings, which will be denoted as DA previous , where again each user rating r u,i is coupled with its own average DA previous (r u,i ), however when computing this average, only ratings entered by the same user (u) prior to r u,i are taken into account.Formally, this is computed as Results of [7] assert that both these approaches improve the CF predictions, however they both suffer from the problem of exhibiting increased storage requirements, since for each rating the dynamic average associated to the particular rating must be available, in order to compute predictions.

The Proposed Algorithm
Under the proposed approach for computing dynamic averages, instead of computing and storing a separate average for each particular rating, the ratings of each user u are partitioned into clusters with respect to their timestamps, and a single dynamic average is computed and stored for each time cluster.This approach drastically reduces the space requirements for storing the dynamic averages.
In order to group a user's u ratings into clusters, the proposed algorithm iterates over the ratings entered by u in ascending time order, following a greedy approach, so as to reduce the computational complexity of cluster formulation.More specifically, the algorithm initially considers a cluster including the two first ratings of the user and computes the average rating abstention interval between the elements of the cluster (which is initially equal to the difference of the timestamps of the two cluster elements).Subsequently, it examines if the next rating can be incorporated into the current cluster: if the abstention interval between the timestamp of the next rating and the timestamp of the latest rating within the cluster is less than the average abstention interval between consecutive ratings within the cluster, then the new rating is appended to the current cluster; in the opposite case, the current cluster is finalized and the clustering procedure is executed anew from the next rating onwards.
Effectively, this technique locates, for each user, a point in time when she has abstained from submitting ratings for a period of time which is longer than she had usually done recently (i.e. at that particular time period) and assumes that this abstention may signify a change of user's rating strictness.It is worth noting that while a number of clustering algorithms exist (e.g.k-means, kmedoids, CLARA [40]), all these algorithms require that the number of clusters is known a priori, a condition that is not met by the user rating datasets.Various techniques are presented in the literature for computing the number of clusters that will deliver the optimal clustering (e.g.[41]); the investigation of these techniques and the use of different clustering algorithms will be part of our future work.
The proposed algorithm is illustrated below in pseudocode; besides formulating clusters, the pseudocode also computes the dynamic average for each cluster.
// INPUT: rating database // Output: clusters array, containing for each user the computed set of clusters // each cluster contains the ratings and the respective dynamic average clusters =  FOREACH user u RatingsDB clusters[u] =  ru = retrieveAllUserRatings(u, RatingsDB) sort ru on timestamp with ascending order cluster_id = 1 cluster_id_set = {ru [1], ru [2]} // must include the first two ratings, in order to have an interval to compare with cluster_ratings_counter = 2 cluster_ratings_sum = rating(ru [1]) + rating(ru [2]) cluster_abstention_avg = timestamp(ru [2]) -timestamp(ru [1]) FOR i = 3 TO count(ru)-1 // we cannot have a single rating in a cluster; hence the last rating belongs by // default in the user's last cluster current_interval = timestamp(ru Due to the fact that the algorithm operates in a greedy fashion, it is prone to the formulation of an excessive number of clusters: if the first elements that are added to the cluster have small differences in their timestamps, then trivial clusters containing only two (or very few) ratings will be created.To ameliorate this effect, the constant FACTOR is used in the pseudocode, which adjusts the new ratings cluster detection threshold: setting FACTOR to values higher than 1 relaxes the criterion for incorporating each next rating into the current cluster, decreasing thus the probability that trivial clusters will be created.In our experiments, reported in Section 4, we explored different candidate values for the FACTOR parameter.More specifically, we used the following candidate FACTOR values: 1, 1.25, 1.5, 1.75, 2, 2.5, 3, 3.5, 4, 4.5, 5, 7.5 and 10.While, theoretically, FACTOR can be set to values lower than 1, this worsens the problem of trivial cluster creation, and therefore it is not considered in our evaluation.

Performance Evaluation
In this section, we report on our experiments through which we compared the proposed algorithm, DA clusters , against the dynamic average-based algorithms introduced in [7], as well as the plain CF algorithm (which is used as a yardstick).
In this comparison we consider the following aspects: 1. Prediction accuracy; for this comparison, we used two well-established error metrics, namely the mean absolute error (MAE) metric, as well as the Root Mean Squared Error (RMSE) that 'punishes' big mistakes more severely.RMSE was used in the Netflix competition [12].2. The coverage of the algorithm, i.e. the percentage of the cases for which a prediction can be computed [42].3. The probability that an algorithm computes the correct user rating.Since user ratings are typically integer numbers, while predictions are calculated as real numbers, for comparing the prediction to the actual user rating we round the prediction to the nearest integer.This is analogous to the practice used in the Netflix Competition [12].4. The number of dynamic averages the algorithm stores, in order to perform prediction computation [7].To compute the MAE, the RMSE and the probability to compute the correct prediction, we employed the standard "hide one" technique [4]: each time, we hid a random rating in the database and then predicted its value based on the ratings of other non-hidden items.For each user, this procedure was executed for 10 randomly selected ratings entered by the particular user; therefore the computation of the MAE, the RMSE and the correct prediction probability was performed considering all users in the database.
The algorithms employing dynamic averages may exhibit different coverage, since the introduction of dynamic averages modifies the similarity metrics, and henceforth users that are deemed "similar" when using the plain CF algorithm (i.e.their standard Pearson o similarity surpasses a threshold) may be deemed "not similar" when using the dynamic average-aware Pearson similarity, or vice versa.Under this condition, some users that are characterized as "grey sheep" [42] when using the plain CF algorithm (i.e. did not have enough near neighbours for a recommendation to be computed) may gain enough neighbours when using a dynamic averagebased algorithm, thus increasing coverage; conversely some users for which a recommendation was computed using the plain CF algorithm may become "grey sheep" when using a dynamic average-based algorithm because they lost some near neighbors, in which case coverage decreases.
For our experiments we used a machine equipped with six Intel Xeon E7 -4830 @ 2.13GHz CPUs, 256GB of RAM and one 900GB HDD with a transfer rate of 200MBps, which hosted the datasets and ran the rating prediction algorithms.
In the following paragraphs, we report on our experiments regarding eight datasets.Four of these datasets are obtained from Amazon [8], [9], three from MovieLens [10], [11], and one from Netflix [12].These eight datasets used in our experiments (a) contain reliable timestamps (most of the ratings within each dataset have been entered in real rating time and not in a batch mode), (b) are up to date (published between 1998 and 2016), (c) are widely used as benchmarking datasets in CF research and (d) vary with respect to type of dataset (movies, music, videogames and books) and size (from 2MB, up to 4.7GB).The basic properties of these datasets are summarized in Table 1.
In each dataset, users initially having less than 10 ratings were dropped, since users with few ratings are known to exhibit low accuracy in predictions computed for them [3].This procedure did not affect the three MovieLens and the one NetFlix datasets, because these four datasets contain only users that have rated 20 items or more.Furthermore, we detected cases where for a particular user, all her ratings' timestamps were almost identical (i.e. the difference between the minimum and maximum timestamps was less than 30 seconds).These users were dropped as well, since this timestamp distribution indicated that the ratings were entered in a batch mode, hence the assigned timestamps are not representative of the actual time that these ratings were given by the users.In the following paragraphs, we report on our findings regarding the performance of the algorithm proposed in this work, versus the DA previous and the DA vicinity algorithms reported in [7], and the plain CF algorithm, which uses the standard Pearson correlation coefficient.

The Amazon "Videogames" Dataset
For the Amazon Videogames dataset, when using the plain CF algorithm, predictions could be formulated for 72.15 % of the cases; in the rest of them, the respective users had no neighbors with a positive Pearson coefficient, i.e. no candidate recommenders, and therefore no prediction could be computed for them.
We can observe that in the DA previous algorithm, which was the winner in the respective experiment presented in [7], this percentage drops by 4.29 %, while, on the other hand, the DA previous algorithm improves the percentage of correct predictions by 0.76 %, while it also reduces the MAE by 7.9 % and the RMSE by 5.9 %.
As far as the proposed algorithm is concerned, all tested variants (a variant corresponds to a particular setting of the FACTOR parameter) reduce the MAE from 3.39 % (FACTOR = 10) to 11.9 % (FACTOR = 1) in comparison to the plain CF algorithm, at the expense of a coverage reduction, which ranges from 1.15 % (FACTOR = 10) to 5.87 % (FACTOR = 1.0), again in comparison to the plain CF algorithm.Works such as [18], [19] and [43] also assert that a tradeoff between coverage and accuracy exists, and in order to obtain a single measure for rating the suitability of each algorithm, the harmonic mean (HM) of these measures can be adopted; this is analogous to the goal of maximizing the HM of precision and recalltermed the F1 measurein information retrieval [44]).Towards this direction, we adopt the following formula introduced in [43]: (6) where aAlg is a rating prediction algorithm and Alg is the set of all algorithms participating in the evaluation.For the DA clusters algorithm, different settings for the FACTOR parameter are considered as different algorithms in the context of the evaluation.normCov(a) and normAcc(a) denote the normalized coverage and normalized accuracy, respectively, of algorithm a.These are computed according to the following formulas: The results obtained from the Amazon "Videogames" dataset, are depicted in Table 2.For conciseness purposes, only the four best performing variants of the DA clusters algorithm are reported, i.e. the four variants achieving the highest HM value.This practice is followed in the presentation of the results for all datasets.
Column % coverage corresponds to the percentage of cases for which the algorithm could compute predictions, orequivalentlywhen the number of near neighbors computed using the algorithm's similarity metric was adequate [1], [2], to formulate a rating prediction.Columns MAE and RMSE illustrate the mean absolute error and the root mean square error, respectively, while column HM (coverage, accuracy) depicts the harmonic mean measure.Column Correct predictions % illustrates the percentage of the cases for which the algorithm achieved to compute the exact rating given by the user.Finally, column avgs reduction % shows the reduction in storage space requirements against the dynamic average-based algorithms presented in [7], which is achieved by the DA clusters algorithm, due to the fact that only one dynamic average is computed and stored per cluster, instead of computing one dynamic average per rating; this metric is computed only for the variants of DA clusters .
In Table 2 we can observe that the highest harmonic mean is achieved by DA clusters @2.0, i.e. the variant of DA clusters where the FACTOR parameter has been set to 2.0.This variant achieves an overall reduction in the MAE equal to 9.2 % against the plain CF algorithm, surpassing the respective reduction atained by the DA previous algorithm by 1.3 %.The DA clusters algorithm also reduces the RMSE metric by an additional 2.01 % in comparison to the performance of the DA previous algorithm, and at the same time increases both the coverage by 1.52 % and the percentage of correct predictions by 0.36 %.
Furthermore, when applying the proposed technique in this dataset, the dynamic averages stored are reduced by 78 % in comparison to the dynamic averages stored by the algorithms presented in [7], thus introducing significant space gains (recall that in both dynamic averagebased algorithms, presented in [7], every rating is coupled with its own particular average).

The Amazon "CDs and Vinyl" Dataset
The results obtained from the Amazon "CDs and Vinyl" dataset, are depicted in Table 3. Again, when using the plain CF algorithm, predictions could be formulated for 59.3 % of the cases; in the rest of the cases, the respective users had no neighbor with a positive Pearson coefficient, i.e. no candidate recommenders, and therefore no prediction could be computed for them.
In comparison to the plain CF algorithm, the DA previous ratings algorithm, which was the winner in the respective test presented in [7], reduces the MAE by 6.35 %, the RMSE by 5.28 % and increases the percentage of correct predictions by 0.33 %, at the expense of reducing coverage by 4.22 %. min(MAE) = 0.727 (DA clusters @1.0); max(MAE) = 0.826 (plain CF) min(coverage) = 66.28 (DA clusters @1.0); max(coverage) = 72.15(plain CF) However, DA clusters @3.5, which achieves the best HM across all tested algorithms, further reduces the MAE by 0.68 % and the RMSE by 0.28 %, while at the same time increases the coverage by 1.87 % and the percentage of correct predictions by 0.44 % (all differences are reported in comparison with DA previous ).Space-wise, DA clusters @3.5 necessitates the storage of 86 % less dynamic averages than the dynamic average techniques presented in [7].min(MAE) = 0.657 (DA clusters @1.0); max(MAE) = 0.74 (plain CF) min(coverage) = 52.75(DA clusters @1.0); max(coverage) = 59.30(plain CF)

The Amazon "Movies and TV" Dataset
The results obtained from the Amazon "Movies and TV" dataset, are depicted in Table 4.When using the plain CF algorithm, predictions could be formulated for 78.50 % of the cases.In the DA previous ratings algorithm (which was the winner in the respective test presented in [7]) this percentage drops by 3.62 %, while the percentage of correct predictions increases by 0.88 %, the MAE decreases by 7.54 % and the RMSE drops by 5.80 %.
The proposed technique achieves to further increase prediction quality, while at the same time limits the loss in coverage.More specifically, the DAclusters@2.0variant (which achieves the highest HM), exhibits smaller MAE and RMSE in relation to DA previous by 1.38 % and 2.02 %, respectively, while it also increases the percentage of correct predictions by 0.42 % and coverage by 1.35 %.It additionally reduces the number of dynamic averages that must be stored by 80 %.

The Amazon "Books" Dataset
The results obtained from the Amazon "Books" dataset, which is the largest Amazon dataset, are depicted in Table 5.When using the plain CF algorithm, predictions could be formulated for 53.76 % of the cases.In the DA previous ratings algorithm (which was the winner of the respective experiment reported in [7]), this percentage drops by 1.51 %, while the percentage of correct predictions increases by 0.36 %, the MAE reduces by 2.2 % and the RMSE drops by 1.9 %. min(MAE) = 0.551 (DA clusters @1.0); max(MAE) = 0.631 (plain CF) min(coverage) = 49.81(DA clusters @1.0); max(coverage) = 53.76(plain CF) In Table 5 we can observe that the variant DA clusters @2.0 is the one achieving the highest HM.In comparison to the DA previous ratings algorithm, the DA clusters @2.0 variant reduces the MAE and the RMSE by 5.71 % and 5.16 %, respectively and increases the percentage of correct predictions by 0.28 %, while practically leaving coverage unaffected.Finally, it necessitates the storage of 84 % less dynamic averages in comparison to both algorithms introduced in [7] (DA previous and DA vicinity ).

The MovieLens "Old 100K" Dataset
The results obtained from the MovieLens "Old 100K" dataset are depicted in Table 6.When using the plain CF algorithm, predictions could be formulated for 99.82 % of the cases (due to the dataset's high density).In the DA previous ratings algorithm (which was ranked first in the respective experiment presented in [7]), the coverage is practically not affected, while the percentage of correct predictions increases by 1.19 %, the MAE and the RMSE fall by 3.47 % and 3.03 %, respectively.
In table 6 we can observe that the variant DA clusters @1.75 is the one achieving the highest HM.In comparison to DA previous , the DA clusters @1.75 variant exhibits lower MAE and RMSE by 14.40 % and 13.36 %, respectively, while it increases the percentage of correct predictions by 7.44 % and maintains the same high coverage.Space-wise, DA clusters @1.75 variant necessitates the storage of 76 % less dynamic averages than the dynamic average techniques presented in [7].min(MAE) = 0.599 (DA clusters @1.0); max(MAE) = 0.750 (plain CF) min(coverage) = 99.78(DA clusters @5.0); max(coverage) = 99.84(DA clusters @1.75)

The MovieLens "Latest-20M, Recommended for New Research" Dataset
The results obtained from the MovieLens "Latest-20M, recommended for new research" dataset, are depicted in Table 7.Under the plain CF algorithm, predictions could be formulated for 99.96 % of the cases (again, due to the dataset's high density).In the DA previous ratings algorithm this percentage remains practically unaffected, dropping by 0.04 %, while the percentage of correct predictions increases by 1.48 %, the MAE reduces by 3.26 % and the RMSE declines by 3.05 %.
In Table 7 we can observe that the variant DA clusters @1.75 is the one scoring the highest HM.In comparison to the DA previous variant, the DA clusters @1.75 variant reduces the MAE and the RMSE by 11.31 % and 9.71 % respectively, and increases the percentage of correct predictions by 4.92 % with no change in coverage.Additionally, the DA clusters @1.75 reduces the requirements for storage of dynamic averages by 79 %.The results obtained from the MovieLens "Latest 100K, Recommended for education and development (small)" dataset, are depicted in Table 8.When using the plain CF algorithm, predictions could be formulated for 99.57% of the cases (again, due to the dataset's high density).In the DA previous ratings algorithm (which was the winner in the respective experiment reported in [7]) this percentage drops by 0.21 %, while the percentage of correct predictions increases by 0.12 %, the MAE reduces by 3.36 % and the RMSE drops by 3.75 %.As shown in Table 8, the DA clusters @1.5 variant is the one exhibiting the highest HM.As compared to the DA previous algorithm, the DA clusters @1.5 variant reduces the MAE and the RMSE by 13.15 % and 11.78 %, respectively, while it increases the correct prediction percentage by 5.38 % and achieves practically the same coverage.In terms of dynamic averages storage, the DA clusters @1.5 variant reduces the respective requirements by 78 %, in comparison to the dynamic average-based algorithms presented in [7].

The "NetFlix Competition" Dataset
The results obtained from the "NetFlix Competition" dataset, are depicted in Table 9.Again, using the plain CF algorithm, predictions could be formulated for 99.12 % of the cases.In the respective experiments reported in [7], the DA vicinity algorithm was the winner, and our measurements verify that it surpasses the DA previous algorithm in this dataset.In comparison to the plain CF algorithm, the DA vicinity algorithm reduces the MAE and the RMSE by 4.87 % and 5.77 %, respectively, and achieves an increase of the correct predictions percentage by 4.18 %, at the expense of reducing coverage by 0.29 %.
Table 9 indicates that the variant DA clusters @1.75 exhibits the highest HM.In comparison to the DA vicinity algorithm, the DA clusters @1.75 variant decreases the MAE and the RMSE by 5.0 % and 2.43 % respectively, while it also increases the percentage of correct predictions by 1.38 % and coverage by 0.14 %.Furthermore, the DA clusters @1.75 variant reduces the number of dynamic averages that must be stored by 90 %, in comparison to the dynamic average-based algorithms presented in [7].

Algorithm Comparison
Figure 1 depicts the improvement in MAE achieved by the dynamic average-based algorithms; for each dataset, the performance of the DA previous and DA vicinity algorithms is presented, together with the performance of the DA clusters variant that achieved the highest HM in the particular dataset.In all cases, the performance of the plain CF algorithm is used as a baseline.
The graph shows that the DA clusters algorithm introduced in this article achieves a mean improvement of 11.5 % against the plain CF algorithm across all datasets, ranging from 7.0 % to 17.9 %.This surpasses the performance of the DA previous and DA vicinity algorithms, for which the improvements in the average MAE reduction across all datasets are 4.7 % and 2.8 %, respectively.It is worth noting that the DA clusters algorithm is consistently ranked first in all eight datasets.Figure 2 presents the respective improvements regarding the RMSE metric.In all cases, the DA clusters algorithm, presented in this article, is ranked first, scoring an improvement of 10.0 % on average, while the DA previous and DA vicinity algorithms proposed in [7] achieve corresponding improvements of 3.9 % and 2.8 %, respectively.Moreover, the improvements are very similar to those of the MAE metric shown in Figure 1, indicating that prediction improvements are spread uniformly among predictions with high and low errors (recall that the RMSE metric strongly penalizes predictions with high errors).Figure 3 illustrates the improvements regarding the correct prediction percentage.The DA clusters algorithm is ranked first across all datasets, achieving improvements ranging from 0.6 % to 8.6 % against the baseline algorithm (which is the plain CF algorithm), scoring an average improvement of 3.7 %.We can observe that in the denser datasets (all MovieLens and the Netflix dataset), the improvements in correct prediction percentage achieved by the DA clusters algorithm are more substantial.In comparison to the algorithms proposed in [7], the DA clusters algorithm outperforms both the DA previous and the DA vicinity algorithms, which achieve an average improvement equal to 1.3 % and 0.9 %, respectively.Figure 4 illustrates the reduction in coverage sustained when using the dynamic average-based algorithms, against the baseline algorithm (plain CF).In this case, the DA clusters algorithm is ranked second regarding its average performance, losing 1.2 % of coverage on average, with its coverage loss varying from -0.02 % (i.e. it achieves a marginal coverage increase in one dataset, namely the "MovieLens Old-100K" dataset) to 2.77 %.The winner algorithm regarding this metric is DA vicinity , for which the average coverage drop is 0.7 %, while the DA previous algorithm is ranked third, with an average coverage reduction of 1.8 %.In Figure 4   Finally, Figure 5 illustrates the reduction in storage needs for dynamic averages that is achieved by the DA clusters algorithm, as compared to the dynamic average-based algorithms proposed in [7]; since both of these algorithms require the same number of dynamic averages, only one set of measurements is presented in Figure 5.For each dataset, we consider the variant of DA clusters that achieves the highest HM.We can observe that the DA clusters algorithm substantially reduces the need for storing dynamic averages, with the respective gains ranging from 76 % to 90 %, having an average of 81 % across all datasets.Summarizing, we can see that in all datasets the DA clusters algorithm is ranked first regarding the MAE reduction, the RMSE reduction and the correct predictions' percentage improvement.The only metric for which the DA clusters algorithm is ranked second is coverage, where it lags behind the DA vicinity algorithm; however, the DA clusters algorithm performs substantially better -0.50% 0.00% 0.50% than DA vicinity regarding the MAE reduction, therefore the DA clusters algorithm is clearly ranked first when jointly examining the coverage and accuracy metric, as shown in the discussions presented in Sections 4.1-4.8.Finally, the DA clusters algorithm introduces significant savings in space requirements for storing the dynamic averages, necessitating the storage of 81 % less dynamic averages than the algorithms presented in [7].

Discussion
From the experiments above, it is clear that the DA clusters algorithm leads to more accurate predictions than the algorithms presented in [7], namely DA vicinity and DA previous .This is owing to the fact that the DA clusters algorithm computes local averages taking into account only ratings that are temporally "close" together, taking into account the expected user rating frequency through the examination of rating abstention periods.On the contrary, when computing the dynamic average for some rating r, the DA previous algorithm takes into account all previous ratings, regardless of whether these are temporally close or distant to r.Consequently, the value of the dynamic average for r is affected by ratings that are temporally distant (where "distant" should be interpreted in relation to the nominal user rating frequency) and are thus highly likely to belong to periods during which the user used to follow different rating practices.Conversely, ratings that have been entered shortly after r (where "shortly" should be again interpreted in relation to the nominal user rating frequency), which should be grouped in the same rating practice period with r and be taken into account in the computation of the dynamic average of r, are disregarded.Notably, in dense datasets, where user rating frequencies are higher, and more clusters are formulated per user, the in accuracy are considerably higher.
Considering the DA vicinity algorithm, when the dynamic average for some rating r is computed, all ratings are taken into account, albeit smaller weights are used for temporally distant ones.However, for periods where the user's rating frequency is low, this practice reduces the weight of ratings that should be grouped in the same rating practice period with r, introducing a source of inaccuracy in the dynamic average computation.Additionally, ratings that are temporally very distant from r are included in the computation of the dynamic average related to r, and these ratings effectively add noise to the computation of the dynamic average.

Conclusion and Future Work
In this article we have introduced the concept of rating time clusters, i.e. groups of ratings with temporal cohesion, and for which the user is assumed to follow the same rating practice.Furthermore, we have presented DA clusters , a novel algorithm which computes rating time clusters and one dynamic average per cluster for each cluster; the computed dynamic averages are then utilized in the rating prediction process.The proposed algorithm has been experimentally verified using 8 datasets and compared to the algorithms proposed in [7] (DA vicinity and DA previous ), and has been found to consistently outperform both of them, in terms of prediction accuracy and overall performance, i.e. their performance taking into account the accuracy and coverage metrics.Improvement in rating prediction accuracy indicates that the DA clusters algorithm is able to follow more closely shifts in rating prediction practices.In particular, the average MAE reduction compared to the DA previous and the DA vicinity algorithms is 6.7 % and 8.7 % respectively, whereas the corresponding improvements regarding the RMSE metric are 6.1 % and 7.1 %.Considering the correct prediction metric, the proposed algorithm outperforms the DA previous and DA vicinity algorithms by 2.5 % and 2.8 %, respectively.Finally, the percentage of the dynamic averages stored is reduced by 81 %, compared to the dynamic average-based algorithms presented in [7].
Our future work will focus on adapting the proposed approach for use with matrix factorization techniques (an alternative method to CF, used in recommender systems, based on matrix decomposition) [45], as well as comparing it with time-aware matrix factorization models min(MAE) = 1.146 (DA clusters @1.0); max(MAE) = 1.379 (plain CF) min(coverage) = 99.9 (DA vicinity ); max(coverage) = 99.96(plain CF) 4.7 The MovieLens "Latest 100K, Recommended for Education and Development (small)" Dataset

Figure 1 .
Figure1.MAE improvement achieved by the DA clusters and the algorithms proposed in[7]

Figure 3 .
Figure 3. Correct predictions percentage improvement achieved by the DA clusters and the algorithmsproposed in[7]

Figure 4 .
Figure 4. Predictions (cases) lost when applying the DA clusters algorithm and the algorithms proposed in [7] (less is better)

Figure 5 .
Figure5.Storage needs reductions achieved by the DA clusters algorithm against the algorithms proposed in[7]

Table 3 .
Amazon "CDs and Vinyl" dataset results

Table 4 .
Amazon "Movies and TV" dataset results

Table 9 .
"NetFlix Competition" dataset results [7]E improvement achieved by the DA clusters and the algorithms proposed in[7] considerably higher in the Amazon datasets, which are sparser, while in the denser datasets (all MovieLens datasets and the Netflix dataset), coverage loss is negligible (less than 0.29 % in all cases).