This project treats a book recommendation scenario as a multiple-play multi-armed bandit problem:
- There are N books (arms)
- At each round we display 6 books to a random user
- Reward = number of displayed books the user actually purchases (0–6)
- Goal: maximize cumulative (or average) purchases over many interactions
We implemented a UCB1-style algorithm adapted for selecting the top-6 books simultaneously, using individual per-book reward feedback.
- Modified the environment to return individual purchase information (essential for proper credit assignment)
- Fixed numerical stability issues in UCB computation (safe handling of unexplored books)
- Used standard UCB1 confidence term with optional tuning of the exploration constant
- Ran both short (10k steps) and long (100k steps) simulations

