# Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures

Authors
Type
Published Article
Publication Date
Apr 17, 2002
Submission Date
Apr 17, 2002
Identifiers
arXiv ID: cs/0204040
Source
arXiv
The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle $t$ action $y_t$ results in perception $x_t$ and reward $r_t$, where all quantities in general may depend on the complete history. The perception $x_t$ and reward $r_t$ are sampled from the (reactive) environmental probability distribution $\mu$. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if $\mu$ is known. Reinforcement learning is usually used if $\mu$ is unknown. In the Bayesian approach one defines a mixture distribution $\xi$ as a weighted sum of distributions $\nu\in\M$, where $\M$ is any class of distributions including the true environment $\mu$. We show that the Bayes-optimal policy $p^\xi$ based on the mixture $\xi$ is self-optimizing in the sense that the average value converges asymptotically for all $\mu\in\M$ to the optimal value achieved by the (infeasible) Bayes-optimal policy $p^\mu$ which knows $\mu$ in advance. We show that the necessary condition that $\M$ admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on $\M$. As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that $p^\xi$ is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in {\em all} environments $\nu\in\M$ and a strictly higher value in at least one.