Java Python Homework 1
ELEN E6885: Introduction to Reinforcement Learning
September 18, 2024
Problem 1 (2-Armed Bandit, 20 Points)
Consider the following 2-armed bandit problem: the first arm has a fixed reward 0 .3 and the second arm has a 0-1 reward following a Bernoulli distribution with probability 0 .6, i.e., arm 2 yields reward 1 with probability 0.6. Assume we selected arm 1 at t = 1, and arm 2 four times at t = 2, 3, 4, 5 with reward 0, 1, 0, 0, respectively. We use the sample-average technique to estimate the action-value, and then use it to guide our choices starting from t = 6.
1. [5 pts] Which arm will be played at t = 6, 7, respectively, if the greedy met