liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Multistate Temporal Difference Target for Model-Free Reinforcement Learning
Univ Newcastle, Australia.
Univ Newcastle, Australia.
Linköping University, Department of Computer and Information Science. Linköping University, Faculty of Science & Engineering.
2025 (English)In: IEEE Transactions on Neural Networks and Learning Systems, ISSN 2162-237X, E-ISSN 2162-2388, Vol. 36, no 9, p. 16854-16863Article in journal (Refereed) Published
Abstract [en]

Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value function estimates for states or state-action pairs using a TD target. This target represents an improved estimate of the true value by incorporating both immediate rewards and the estimated value of subsequent states. We propose an enhanced multistate TD (MSTD) target that utilizes multiple subsequent states for a more accurate value function estimation compared to traditional TD learning, which relies on a single subsequent state. Building on this new MSTD concept, we develop actor-critic algorithms that include the management of replay buffers in two modes and integrate with deep deterministic policy optimization (DDPG) and soft actor-critic (SAC). Numerical experiment results demonstrate that algorithms employing the MSTD target improve learning performance compared to traditional methods. In addition, we analyze the convergence of Q-learning with MSTD.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC , 2025. Vol. 36, no 9, p. 16854-16863
Keywords [en]
Reinforcement learning; Estimation; Training; Convergence; Trajectory; Accuracy; Temporal difference learning; Optimization; Monte Carlo methods; Indexes; Actor-critic learning; Q value; reinforcement learning; state-action value; temporal difference (TD)
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:liu:diva-213694DOI: 10.1109/TNNLS.2025.3564078ISI: 001484784300001PubMedID: 40343824Scopus ID: 2-s2.0-105004943838OAI: oai:DiVA.org:liu-213694DiVA, id: diva2:1959573
Available from: 2025-05-21 Created: 2025-05-21 Last updated: 2026-04-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textPubMedScopus

Search in DiVA

By author/editor
Zhang, Lepeng
By organisation
Department of Computer and Information ScienceFaculty of Science & Engineering
In the same journal
IEEE Transactions on Neural Networks and Learning Systems
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 64 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf