Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

Guo D; ANGELA YU-CHEN LIN; Guo D;Yu A.J.

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

Journal

Advances in Neural Information Processing Systems

Journal Volume

2018-December

Pages

5176-5185

Date Issued

2018

Author(s)

Guo D

ANGELA YU-CHEN LIN

URI

https://www.scopus.com/inward/record.uri?eid=2-s2.0-85064845543&partnerID=40&md5=13815c87279e63672896b02589b01171

https://scholars.lib.ntu.edu.tw/handle/123456789/625615

Abstract

How humans make repeated choices among options with imperfectly known reward outcomes is an important problem in psychology and neuroscience. This is often studied using multi-armed bandits, which is also frequently studied in machine learning. We present data from a human stationary bandit experiment, in which we vary the average abundance and variability of reward availability (mean and variance of the reward rate distribution). Surprisingly, we find subjects significantly underestimate prior mean of reward rates - based on their self-report on their reward expectation of non-chosen arms at the end of a game. Previously, human learning in the bandit task was found to be well captured by a Bayesian ideal learning model, the Dynamic Belief Model (DBM), albeit under an incorrect generative assumption of the temporal structure - humans assume reward rates can change over time even though they are truly fixed. We find that the “pessimism bias” in the bandit task is well captured by the prior mean of DBM when fitted to human choices; but it is poorly captured by the prior mean of the Fixed Belief Model (FBM), an alternative Bayesian model that (correctly) assumes reward rates to be constants. This pessimism bias is also incompletely captured by a simple reinforcement learning model (RL) commonly used in neuroscience and psychology, in terms of fitted initial Q-values. While it seems sub-optimal, and thus mysterious, that humans have an underestimated prior reward expectation, our simulations show that an underestimated prior mean helps to maximize long-term gain, if the observer assumes volatility when reward rates are stable, and utilizes a softmax decision policy instead of the optimal one (obtainable by dynamic programming). This raises the intriguing possibility that the brain underestimates reward rates to compensate for the incorrect non-stationarity assumption in the generative model and a simplified decision policy. © 2018 Curran Associates Inc..All rights reserved.

Other Subjects

Bayesian networks; Machine learning; Neurology; Reinforcement learning; Decision policy; Generative model; Learning models; Multi armed bandit; Non-stationarities; Rate distributions; Reinforcement learning models; Temporal structures; Dynamic programming

Type

conference paper

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)