liu.seSearch for publications in DiVA
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scheduling and Aggregation Design for Asynchronous Federated Learning Over Wireless Networks
Linköping University, Department of Electrical Engineering, Communication Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-9547-5580
Linköping University, Department of Electrical Engineering, Communication Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0001-5621-2860
Linköping University, Department of Electrical Engineering, Communication Systems. Linköping University, Faculty of Science & Engineering.ORCID iD: 0000-0002-7599-4367
2023 (English)In: IEEE Journal on Selected Areas in Communications, ISSN 0733-8716, E-ISSN 1558-0008, Vol. 41, no 4, p. 874-886Article in journal (Refereed) Published
Abstract [en]

Federated Learning (FL) is a collaborative machine learning (ML) framework that combines on-device training and server-based aggregation to train a common ML model among distributed agents. In this work, we propose an asynchronous FL design with periodic aggregation to tackle the straggler issue in FL systems. Considering limited wireless communication resources, we investigate the effect of different scheduling policies and aggregation designs on the convergence performance. Driven by the importance of reducing the bias and variance of the aggregated model updates, we propose a scheduling policy that jointly considers the channel quality and training data representation of user devices. The effectiveness of our channel-aware data-importance-based scheduling policy, compared with state-of-the-art methods proposed for synchronous FL, is validated through simulations. Moreover, we show that an "age-aware" aggregation weighting design can significantly improve the learning performance in an asynchronous FL setting.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC , 2023. Vol. 41, no 4, p. 874-886
Keywords [en]
Training; Servers; Data models; Computational modeling; Training data; Convergence; Load modeling; Federated Learning; asynchronous training; wireless networks; scheduling; aggregation
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:liu:diva-193423DOI: 10.1109/JSAC.2023.3242719ISI: 000966098400001Scopus ID: 2-s2.0-85149368263OAI: oai:DiVA.org:liu-193423DiVA, id: diva2:1754629
Note

Funding Agencies|Zenith; Excellence Center at Linkoping-Lund in Information Technology (ELLIIT); Knut and Alice Wallenberg Foundation

Available from: 2023-05-04 Created: 2023-05-04 Last updated: 2026-03-04
In thesis
1. Communication-Efficient Resource Allocation for Wireless Federated Learning Systems
Open this publication in new window or tab >>Communication-Efficient Resource Allocation for Wireless Federated Learning Systems
2023 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The training of machine learning (ML) models usually requires a massive amount of data. Nowadays, the ever-increasing number of connected user devices has benefited the development of ML algorithms by providing large sets of data that can be utilized for model training. As privacy concerns become vital in our society, using private data from user devices for training ML models becomes tricky. Therefore, federated learning (FL) with on-device information processing has been proposed for its advantages in preserving data privacy. FL is a collaborative ML framework where multiple devices participate in training a common global model based on locally available data. Unlike centralized ML architecture wherein the entire set of training data need to be centrally stored, in an FL system, only model parameters are shared between user devices and a parameter server. 

Federated Averaging (FedAvg) is one of the most representative and baseline FL algorithms, with an iterative process of model broadcasting, local training, and model aggregation. In every iteration, the model aggregation process can start only when all the devices have finished local training. Thus, the duration of one iteration is limited by the slowest device, which is known as the straggler issue. To resolve this commonly observed issue in synchronous FL methods, altering the synchronous procedure to an asynchronous one has been explored in the literature; that is, the server does not need to wait for all the devices to finish local training before conducting updates aggregation. However, to avoid high communication costs and implementation complexity that the existing asynchronous FL methods have brought in, we alternatively propose a new asynchronous FL framework with periodic aggregation. Since the FL process involves information exchanges over a wireless medium, allowing partial participation of devices in transmitting model updates is a common approach to avoid the communication bottleneck. We thus further develop channel-aware data-importance-based scheduling policies, which are theoretically motivated by the convergence analysis of the proposed FL system. In addition, an age-aware aggregation weighting design is proposed to deal with the model update asynchrony among scheduled devices in the considered asynchronous FL system. The effectiveness of the proposed scheme is empirically proved of alleviating the straggler effect and achieving better learning outcomes compared to some state-of-the-art methods. 

From the perspective of jointly optimizing system efficiency and learning performance, in the rest of the thesis, we consider a scenario of Federated Edge Learning (FEEL) where in addition to the heterogeneity of data and wireless channels, heterogeneous computation capability and energy availability are also taken into account in the scheduling design. Besides, instead of assuming all the local data are available at the beginning of the training process, a more practical scenario where the training data might be generated randomly over time is considered. Hence, considering time-varying local training data, wireless link condition, and computing capability, we formulate a stochastic network optimization problem and propose a dynamic scheduling algorithm for optimizing the learning performance subject to per-round latency requirement and long-term energy constraints. The effectiveness of the proposed design is validated by numerical simulations, showing gains in learning performance and system efficiency compared to alternative methods. 

Abstract [sv]

Att träna modeller inom maskininlärning (ML) kräver vanligtvis enorma mängder data. Numera har det ständigt ökande antalet uppkopplade enheter varit till nytta för utvecklingen av ML-algoritmer då de tillhandahåller stora datamängder som kan användas till att träna modellerna. Eftersom integritetsfrågor blir viktigare i samhället blir det dock besvärligare att använda privat data från användarnas enheter för att träna ML-modeller. Därför har federerad inlärning (FL), där informationsbearbetningen sker på enheten, föreslagits för sina fördelar med att bevara användarens data privat. FL är en samarbetsbaserad ML-teknik där flertalet enheter deltar i träningen av en gemensam global modell som endast baseras på lokalt tillgängliga data. Till skillnad från centraliserad ML-arkitektur där all träningsdata måste lagras i en central server, behöver ett FL-system endast dela modellparametrar mellan användarnas enheter och en central parameterserver.

Federerat genomsnitt (FedAvg) är en av de mest representativa och grundläggande FL-algoritmerna, med en iterativ process som inkluderar modellsändning, lokal träning och modellaggregering. I varje iteration kan modellaggregeringsprocessen endast starta efter att alla enskilda enheter har avslutat sin lokala träning. Följaktligen är varaktigheten av varje iteration strikt begränsad av den långsammaste enheten, vilket brukar kallas eftersläntrarproblemet (eng: the straggler issue). För att lösa det här problemet har övergången från en synkroniserad procedur till en osynkroniserad undersökts i litteraturen. Servern behöver i det senare fallet inte vänta på att alla enheter ska avsluta sin lokala träning innan aggregeringen utförs. För att undvika de höga kommunikationskostnaderna och implementationskomplexiteten som de befintliga asynkrona FL-metoderna har medfört, föreslår vi i stället ett nytt asynkront FL-ramverk med periodisk aggregering. Eftersom FL-processen inkluderar informationsutbyte över resursbegränsade trådlösa medium, är det vanligt att endast en delmängd av de deltagande enheterna tillåts delta i modelluppdateringarna för att undvika flaskhalsar i den trådlösa kommunikationen. Därför vidareutvecklar vi policyer för schemaläggning av enheter för att uppnå bättre prestanda i inlärningen, detta genom att bedöma betydelsen av lokala data och precisionen i den utbytta informationen. Därutöver är aggregeringsvikterna utformade för att lindra den negativa effekten på inlärningsprestandan som uppstår på grund av den asynkrona informationen i FL-systemet. Det bevisas empiriskt att det föreslagna schemat effektivt lindrar eftersläntrarproblemet och uppnår bättre inlärningsresultat än vissa andra moderna metoder.

Från perspektivet att gemensamt optimera systemeffektiviteten och inlärningsprestandan, betraktar vi i resten av avhandlingen ett scenario av federerad kantinlärning där heterogen beräkningskapacitet och tillgången till energi också tas med i schemaläggningen. I stället för att anta att all den lokala data är tillgänglig i början av träningsprocessen, beaktas ett mer praktiskt scenario där träningsdata genereras slumpmässigt över tid. Genom att betrakta den tidsvarierande lokala träningsdatan, kvaliteten på den trådlösa länken och den tillgängliga beräkningskapaciteten, formulerar vi ett stokastiskt nätverksoptimeringsproblem och föreslår en dynamisk schemaläggningsalgoritm som optimerar inlärningsprestandan med hänsyn till begränsningar i fördröjningar och energiförbrukning. Effektiviteten av den föreslagna designen bekräftas av numeriska simuleringar, vilka visar vinster i inlärningsprestanda och systemeffektivitet jämfört med alternativa metoder.  

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2023. p. 30
Series
Linköping Studies in Science and Technology. Licentiate Thesis, ISSN 0280-7971 ; 1969
National Category
Computer Sciences
Identifiers
urn:nbn:se:liu:diva-193736 (URN)10.3384/9789180752329 (DOI)9789180752312 (ISBN)9789180752329 (ISBN)
Presentation
2023-06-16, Systemet, B-building, Campus Valla, Linköping, 10:15 (English)
Opponent
Supervisors
Available from: 2023-05-16 Created: 2023-05-16 Last updated: 2023-05-24Bibliographically approved
2. Wireless Federated Learning: Efficient Communication and Resource Management
Open this publication in new window or tab >>Wireless Federated Learning: Efficient Communication and Resource Management
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The training of machine learning (ML) models usually requires an extensive amount of data. Nowadays, the ever-increasing number of connected user devices has benefited the development of ML algorithms by providing large sets of data that can be utilized for model training. As the society has become more aware of user privacy, using private data from user devices for training ML models becomes more restricted. Therefore, federated learning (FL) with on-device information processing has been proposed for its advantages in preserving data privacy. FL is a collaborative ML framework where multiple devices participate in training a common global model based on locally available data. Unlike the centralized ML architecture wherein the entire set of training data need to be centrally stored, in an FL system, only model parameters (or model updates) are shared between user devices and a parameter server.

We focus on FL deployed at the wireless edge, namely, information exchanges in FL are over wireless networks. There are two major challenges for the considered FL setup: stringent communication resources, and non-independent-and-identically distributed (non-i.i.d.) data. Since the information exchange between different entities takes place over wireless networks, the resource limitation therein affects how many devices can participate in the training and how much information can be reliably communicated in each round of the FL process. One common approach to reduce communication load is partial device participation, allowing only a subset of devices to transmit model updates in every round. The design of scheduling policies is critical for ensuring efficient collaborative training under limited communication resources. On the other hand, data heterogeneity introduces objective bias to the course of model evolution. This differentiates FL from the standard distributed ML frameworks where training data are homogeneously distributed. Federated Averaging (FedAvg) is one of the most representative and baseline FL algorithms, with an iterative process of model broadcasting, on-device training, and model aggregation. One well known issue of FedAvg is the straggler effect. Specifically, in every iteration, the model aggregation process takes place only when all the devices have finished local training. As a result, the duration of one iteration is limited by the slowest device.

To resolve the straggler issue commonly observed in synchronous FL methods, a paradigm shift to asynchronous FL has been explored in the literature; that is, local devices perform local training and updating in an asynchronous manner without system-wide model synchronization. However, the existing asynchronous FL methods incur frequent information exchange and large model variation. To address these issues, in Paper A, we propose a new asynchronous FL framework with periodic aggregation and develop channel-aware data-importance-based device scheduling policies, which are theoretically motivated by the convergence analysis of the proposed FL design. In addition, an age-aware aggregation weight design is proposed to deal with the model update asynchrony among scheduled devices. The effectiveness of the proposed scheme is empirically verified in terms of alleviating the straggler effect and achieving better learning performance compared to the state-of-the-art methods.

In Paper B, we consider Federated Edge Learning (FEEL), where in addition to the heterogeneity in data and wireless channel conditions, heterogeneous computing capability and energy availability are also additional factors that are taken into account in the algorithm design. Under such premises, we aim to develop a dynamic scheduling and resource management design via jointly optimizing system efficiency and learning performance. Besides, instead of assuming that all the local data are available at the beginning of the entire training process, a more practical scenario where the training data are generated randomly over time is considered. To develop a dynamic scheduling and resource allocation algorithm, we formulate a stochastic network optimization problem with a long-term objective and constraints and solve it using the Lyapunov drift-plus-penalty framework. The proposed algorithm makes adaptive decisions on device scheduling, computational capacity adjustment, and allocation of bandwidth and transmit power in every iteration. We provide the convergence analysis for the considered setting with heterogeneous data and time-varying objective functions. The effectiveness of our scheme is verified through simulation results, demonstrating improved learning performance and energy efficiency as compared to baseline schemes.

Finally, most previous research on communication efficiency in FL has focused on the uplink (UL) transmission (from devices to server), whereas the downlink (DL) (from server to devices) has been relatively less investigated. In standard FL, the server broadcasts the global model to the devices in every iteration. Applying differential coding to the global model dissemination can save communication resources. However, devices may occasionally miss differential updates due to wireless link failures and thus fail to reconstruct the model. Consequently, they will continue local training based on an outdated model, or remain idle, until the next full-model broadcast is available. To address this issue, in Paper C we propose a mixed-timescale differential coding (MTDC) scheme that performs differential coding hierarchically at two different levels. With MTDC, between two full-model broadcasts, a device can reconstruct the latest model if missing an update. We provide a convergence analysis, which motivates the design of an age-aware version of the MTDC, and a device scheduling policy for improving communication efficiency. In simulation results, the proposed MTDC schemes deliver superior learning performance compared to baseline methods with similar communication resource budgets and in the presence of device decoding failures.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2025. p. 46
Series
Linköping Studies in Science and Technology. Dissertations, ISSN 0345-7524 ; 2498
National Category
Communication Systems
Identifiers
urn:nbn:se:liu:diva-219907 (URN)10.3384/9789181183986 (DOI)9789181183979 (ISBN)9789181183986 (ISBN)
Public defence
2026-01-26, Nobel (BL32), B Building, Campus Valla, Linköping, 09:00 (English)
Opponent
Supervisors
Available from: 2025-12-08 Created: 2025-12-08 Last updated: 2025-12-08Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopusCorrection

Search in DiVA

By author/editor
Hu, Chung-HsuanChen, ZhengLarsson, Erik G
By organisation
Communication SystemsFaculty of Science & Engineering
In the same journal
IEEE Journal on Selected Areas in Communications
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 376 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • oxford
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf