TPGR: Large-scale Interactive Recommendation with Tree-structured Policy Gradient

Nov. 2018

Researchers in recommendation and search team in Huawei Noah’s Ark Lab publish a research paper in AAAI 2019 (AAAI Conference on Artificial Intelligence), collaboration with Prof. Yong Yu and Prof. Weinan Zhang from Apex Data & Knowledge Management Lab in Shanghai Jiaotong University. In the paper, we study the problem of applying deep reinforcement learning in recommendation scenario in which the action space is large-scale and discrete. We devise a reinforcement learning model based on Tree-structured Policy Gradient Recommendation (TPGR). To handle large discrete action space problem and achieve high recommendation effectiveness, we propose to build up a balanced hierarchical clustering tree over candidate items and then utilize the policy gradient technique to learn the strategy of choosing the optimal subclass at each non-leaf node of the constructed tree. Compared to other RL models, TPGR can achieve better accuracy, larger reward more efficiently. This research work is publicly available at
Interactive recommender systems (IRS) play a key role in most personalized services on mobile devices. Different from the conventional recommendation methods such as content-based filtering and collaborative filtering, where the recommendation process is modeled as a static one, an IRS consecutively recommends items to individual users and receives their feedbacks so as to refine its recommendation policy during such interactive process. To handle the interactive nature, some efforts have been make by modeling the recommendation process as a contextual multi-armed bandit (MAB) problem. However, these works pre-assume that the underlying user preference keeps unchanged during the recommendation process and do not plan for long-run performance explicitly.
Recently, reinforcement learning (RL), which has achieved remarkable success in various challenging scenarios that require both dynamic interaction and long-run planning such as playing games and regulating ad bidding, has been introduced to model the recommendation process and shows its potential to handle the interactive nature in IRS. A RL model is the agent’s policy to act, which is learned during the interaction between the agent and the environment. When applying RL in IRS, the agent is the recommender system, while the environment is all the users.
The recommendation process can be modeled by a Markov Decision Process (MDP), defined as follows.
  • State: a state s is defined as the historical interactions between a user and the recommender system, which can be encoded as a low-dimensional vector via a deep neural network such as MLP or RNN.
  • Action: an action a is to pick an item for recommendation, such as a song or a video, and etc.
  • Reward: all users interacting with the recommender system form the environment that returns a reward r after receiving an action a at the state s, which reflects the user’s feedback to the recommended item.
  • Transition: as the state is the historical interactions, once a new item is recommended and the corresponding user’s feedback is given, the state transition is determined.
  • Avatar

    Reinforcement Learning and Recommender Systems

    Most RL-based models become unacceptably inefficient for IRS with large discrete action space as the time complexity of making a decision is linear to the size of the action space. For all DQN-based solutions [1, 2], a value function, which estimates the expected discounted cumulative reward when taking an action at this state, is learned. The policy’s decision is taken by choosing the action with the largest Q value. As can be observed, to make a decision, all the candidate items have to be evaluated, which makes both learning and utilization intractable for tasks where the size of the action space is large, which is common for IRS. Similar problem exists in most DDPG-based solutions [3, 4] where some ranking parameters are learned and a specific ranking function is applied over all the items to pick the one with the highest ranking score. Therefore, the complexity of choosing an item for these methods also grows linearly with respect to the action space.
    In this work, we propose Tree-structured Policy Gradient Recommendation (TPGR) framework to achieve high efficiency as well as high effectiveness. In TPGR framework, a balanced hierarchical clustering tree is built over the items and picking an item is thus formulated as seeking a path from the root to a certain leaf of the tree, which dramatically reduces the time complexity in both the training and the decision making stages. We utilize policy gradient technique to learn how to make recommendation decisions so as to maximize long-run rewards. To the best of our knowledge, this is the first work of building tree-structured stochastic policy for large-scale interactive recommendation.
    As the first step of TPGR, we need to build a balanced hierarchical clustering tree. A clustering tree is supposed to be balanced if for each node, the heights of its subtrees differ by at most one and the subtrees are also balanced. It is also required that each non-leaf nodes has the same number of child nodes, denoted as c, except for parents of leaf nodes, whose numbers of child nodes are at most c. To construct such a balanced clustering tree, we can perform balanced hierarchical clustering over items following a clustering algorithm which takes a group of vectors and an integer c as input and divides the vectors into c balanced clusters (i.e., the item number of each cluster differs from each other by at most one). By repeatedly applying the clustering algorithm until each sub-cluster is associated with only one item, a balanced clustering tree is constructed.

    An Instance of Balanced Hierarchical Clustering Tree

    The second step of TPGR is to construct a policy network for each non-leaf node in the balanced hierarchical clustering tree built in the first step. We assume that there is a status point to indicate which node is currently located. Thus, picking an item is to move the status point from the root to a certain leaf. Each non-leaf node of the tree is associated with a policy network with a softmax activation function on the output layer. Considering node v where the status point is located, the policy network associated with v takes the current state as input and outputs a probability distribution over all child nodes of v, which indicates the probability of moving to each child node of v. Using a recommendation scenario with eight items for illustration, the constructed balanced clustering tree with the tree depth set to 3 is shown in the figure above. For a given state s, the status pint is initially located at the root (node 1) and moves to one of its child nodes (node 3) according to the probability distribution given by the policy network corresponding to the root (node 1). And the status point keeps moving until reaching a leaf node and the corresponding item (item 8 in the figure below) is recommended to the user.

    Architecture of Tree-structured Policy Gradient Recommendation

    To justify TPGR using public available offline datasets (MovieLens and Netflix), we construct an environment simulator to mimic online environment with principles derived from real-world data. The baselines are Popularity, GreedySVD, MAB methods (LinearUCB, HLinearUCB) and RL methods (DDPG-KNN, DDPG-R, DQN-R). The evaluation metrics include traditional precision, recall, F1 and online learning reward. We find that TPGR outperforms all the baselines in terms of these metrics on both datasets.

    We also the efficiency of TPGR, compared to the baselines, in terms of time for training and decision making. We find that TPGR is much more efficient.
    [1] Xiangyu Zhao et al. Recommendation with Negative Feedback via Pairwise Deep Reinforcement Learning. KDD 2018.
    [2] Guanjie Zheng et al. A Deep Reinforcement Learning Framework for News Recommendation. WWW 2018.
    [3] Xiangyu Zhao et al. Deep Reinforcement Learning for Page-wise Recommendation. RecSys 2018.
    [4] Yujing Hu et al. Reinforcement Learning to Ranking in E-commerce Search Engine: Formalization, Analysis, and Application. KDD 2018.