Potential-based reward shaping for finite horizon online POMDP planning

Adam Eck, Leen-Kiat Soh, Sam Devlin, Daniel Kudenko

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.

Original languageEnglish (US)
Pages (from-to)403-445
Number of pages43
JournalAutonomous Agents and Multi-Agent Systems
Volume30
Issue number3
DOIs
StatePublished - May 1 2016

Fingerprint

Process planning
Planning
Reinforcement learning

Keywords

  • Online planning
  • POMDP
  • Potential-based reward shaping

ASJC Scopus subject areas

  • Artificial Intelligence

Cite this

Potential-based reward shaping for finite horizon online POMDP planning. / Eck, Adam; Soh, Leen-Kiat; Devlin, Sam; Kudenko, Daniel.

In: Autonomous Agents and Multi-Agent Systems, Vol. 30, No. 3, 01.05.2016, p. 403-445.

Research output: Contribution to journalArticle

Eck, Adam ; Soh, Leen-Kiat ; Devlin, Sam ; Kudenko, Daniel. / Potential-based reward shaping for finite horizon online POMDP planning. In: Autonomous Agents and Multi-Agent Systems. 2016 ; Vol. 30, No. 3. pp. 403-445.
@article{4f2de56e57ff4266bf89b2dea54e79b7,
title = "Potential-based reward shaping for finite horizon online POMDP planning",
abstract = "In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.",
keywords = "Online planning, POMDP, Potential-based reward shaping",
author = "Adam Eck and Leen-Kiat Soh and Sam Devlin and Daniel Kudenko",
year = "2016",
month = "5",
day = "1",
doi = "10.1007/s10458-015-9292-6",
language = "English (US)",
volume = "30",
pages = "403--445",
journal = "Autonomous Agents and Multi-Agent Systems",
issn = "1387-2532",
publisher = "Springer Netherlands",
number = "3",

}

TY - JOUR

T1 - Potential-based reward shaping for finite horizon online POMDP planning

AU - Eck, Adam

AU - Soh, Leen-Kiat

AU - Devlin, Sam

AU - Kudenko, Daniel

PY - 2016/5/1

Y1 - 2016/5/1

N2 - In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.

AB - In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.

KW - Online planning

KW - POMDP

KW - Potential-based reward shaping

UR - http://www.scopus.com/inward/record.url?scp=84924192605&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84924192605&partnerID=8YFLogxK

U2 - 10.1007/s10458-015-9292-6

DO - 10.1007/s10458-015-9292-6

M3 - Article

AN - SCOPUS:84924192605

VL - 30

SP - 403

EP - 445

JO - Autonomous Agents and Multi-Agent Systems

JF - Autonomous Agents and Multi-Agent Systems

SN - 1387-2532

IS - 3

ER -