liu.seSearch for publications in DiVA
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Multistate Temporal Difference Target for Model-Free Reinforcement Learning
Univ Newcastle, Australia.
Univ Newcastle, Australia.
Linköpings universitet, Institutionen för datavetenskap. Linköpings universitet, Tekniska fakulteten.
2025 (engelsk)Inngår i: IEEE Transactions on Neural Networks and Learning Systems, ISSN 2162-237X, E-ISSN 2162-2388, Vol. 36, nr 9, s. 16854-16863Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value function estimates for states or state-action pairs using a TD target. This target represents an improved estimate of the true value by incorporating both immediate rewards and the estimated value of subsequent states. We propose an enhanced multistate TD (MSTD) target that utilizes multiple subsequent states for a more accurate value function estimation compared to traditional TD learning, which relies on a single subsequent state. Building on this new MSTD concept, we develop actor-critic algorithms that include the management of replay buffers in two modes and integrate with deep deterministic policy optimization (DDPG) and soft actor-critic (SAC). Numerical experiment results demonstrate that algorithms employing the MSTD target improve learning performance compared to traditional methods. In addition, we analyze the convergence of Q-learning with MSTD.

sted, utgiver, år, opplag, sider
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC , 2025. Vol. 36, nr 9, s. 16854-16863
Emneord [en]
Reinforcement learning; Estimation; Training; Convergence; Trajectory; Accuracy; Temporal difference learning; Optimization; Monte Carlo methods; Indexes; Actor-critic learning; Q value; reinforcement learning; state-action value; temporal difference (TD)
HSV kategori
Identifikatorer
URN: urn:nbn:se:liu:diva-213694DOI: 10.1109/TNNLS.2025.3564078ISI: 001484784300001PubMedID: 40343824Scopus ID: 2-s2.0-105004943838OAI: oai:DiVA.org:liu-213694DiVA, id: diva2:1959573
Tilgjengelig fra: 2025-05-21 Laget: 2025-05-21 Sist oppdatert: 2026-04-07bibliografisk kontrollert

Open Access i DiVA

Fulltekst mangler i DiVA

Andre lenker

Forlagets fulltekstPubMedScopus

Søk i DiVA

Av forfatter/redaktør
Zhang, Lepeng
Av organisasjonen
I samme tidsskrift
IEEE Transactions on Neural Networks and Learning Systems

Søk utenfor DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric

doi
pubmed
urn-nbn
Totalt: 68 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf