Federated Learning With Position-Aware Neurons

Transcription

Federated Learning with Position-Aware NeuronsXin-Chun Li1 , Yi-Chu Xu1 , Shaoming Song2 , Bingshuai Li2 , Yinchuan Li2 ,Yunfeng Shao2 , De-Chuan Zhan11State Key Laboratory for Novel Software Technology, Nanjing University2Huawei Noah’s Ark Lab{lixc, xuyc}@lamda.nju.edu.cn, zhandc@nju.edu.cn{shaoming.song, libingshuai, liyinchuan, shaoyunfeng}@huawei.comAbstractTurn off PANsOutputApplying PANs to FLOutputNot ChangedServerOFFFederated Learning (FL) fuses collaborative modelsfrom local nodes without centralizing users’ data. The permutation invariance property of neural networks and thenon-i.i.d. data across clients make the locally updatedparameters imprecisely aligned, disabling the coordinatebased parameter averaging. Traditional neurons do not explicitly consider position information. Hence, we proposePosition-Aware Neurons (PANs) as an alternative, fusingposition-related values (i.e., position encodings) into neuron outputs. PANs couple themselves to their positions andminimize the possibility of dislocation, even updating onheterogeneous data. We turn on/off PANs to disable/enablethe permutation invariance property of neural networks.PANs are tightly coupled with positions when applied toFL, making parameters across clients pre-aligned and facilitating coordinate-based parameter averaging. PANs arealgorithm-agnostic and could universally improve existingFL algorithms. Furthermore, “FL with PANs” is simple toimplement and computationally friendly.1. IntroductionFederated Learning (FL) [13, 42] generates a globalmodel via collaborating with isolated clients for privacyprotection and efficient distributed training, generally following the parameter server architecture [6, 21]. Clientsupdate models on their devices using private data, and theserver periodically averages these models for multiple communication rounds [27]. The whole process does not transmit users’ data and meets the basic privacy requirements.Represented by FedAvg [27], many FL algorithms aggregate local parameters via a simple coordinate-based averaging [22–25] These algorithms have two kinds of drawbacks. First, as traditional neurons are unaware of theirpositions, neural networks have the permutation dTurn on PANsOutputDownloadOutputChanged1234Client KClient 2Client 1ONON53Shuffle52141Input22345Neurons35214Shuffled Neurons1234545Private DataInput13PANs with equalposition encodings12345PANs with variedposition encodingsFigure 1. Left: Position-Aware Neurons (PANs). We fuseequal/varied position encodings to neurons’ outputs, PANs areturned off/on, and the shuffled networks make the same/differentpredictions, i.e., the permutation invariance property is enabled/disabled. Right: applying PANs to FL. Neurons are coupledwith their positions for pre-alignment.property, implying that hidden neurons could be dislocatedduring training without affecting the local performances.Second, the samples across clients are non-independent andidentically distributed (non-i.i.d.) [11], which could exacerbate the permutation of neural networks during local training, making local models misaligned and leading to weightdivergence [47]. These reasons degrade the performance ofcoordinate-based parameter averaging.Recently, a series of works utilize various matching techniques to align neurons, such as Bayesian nonparametriclearning [38, 44, 45] and optimal transport [2, 33]. First,these methods are too complex to implement. Second, theysolve the misalignment problem after finishing local updates and hence belong to post-processing strategies thatneed additional computation budgets. Fed2 [43] pioneersa novel aspect via designing feature-oriented model structures following a pre-aligned manner. However, it has tocarefully customize the network architecture and only stays10082

at the group level of pre-alignment. By contrast, we explorea more straightforward and general technique to pre-alignneurons during local training procedures.Our work mainly focuses on solving the non-i.i.d. challenge in FL, more specifically, seeking solutions via limiting the permutation invariance property of neural networks.We first summarize the above analysis: the permutation invariance property of neural networks leads to neuron misalignment across local models. The more heterogeneous thedata, the more serious the misalignment is. Hence, our motivation is intuitive: could we design a switch to control thepermutation invariance property of neuron networks? Wepropose Position-Aware Neurons (PANs) as the solution,which couple neurons with their positions. Specifically, foreach neuron (channel for ConvNet [10, 17, 32]), we add ormultiply a position-related value (i.e., position encoding) toits output. We introduce a hyper-parameter to turn on/off thePANs, and correspondingly, to disable/enable the permutation invariance property of neural networks. PANs bindneurons in their positions, implicitly pre-aligning neuronsacross clients even faced with non-i.i.d. data. From anotheraspect, PANs could keep some consistent ingredients inthe forward and backward pass across local models, whichcould reduce the weight divergence. Overall, appropriatePANs facilitate the coordinate-based parameter averagingin FL. Replacing traditional neurons with PANs is simple toimplement and computationally friendly, which is universalto various FL algorithms. Contributions can be briefed as:(1) proposing PANs to disable/enable the permutation invariance property of deep networks; (2) applying PANs toFL, which binds neurons in positions and pre-aligns parameters for better coordinate-wise parameter averaging.2. Related WorksFL with Non-I.I.D. Data: Existing works solve the noni.i.d. data problem in FL from various aspects. [47] pointsout the weight divergence phenomenon in FL and useshared data to decrease the divergence. FedProx [23] takesa proximal term during local training as regularization. FedOpt [30] considers updating the global model via momentum or adaptive optimizers (e.g., Adam [15], Yogi [46]) instead of simple parameter averaging. Scaffold [14] introduces control variates to rectify the local update directionsand mitigates the influences of client drift. MOON [22] utilizes model contrastive learning to reduce the distance between local and global models. Some other works utilizesimilar techniques including dynamic regularization [1], ensemble distillation [3, 26], etc. We take several representative FL algorithms and use PANs to improve them.FL with Permutation Invariance Property: The permutation invariance of neuron networks could lead to neuronmisalignment. PFNM [45] matches local nodes’ parameters via Beta-Bernoulli process [35] and Indian Buffet Pro-cess [9], formulating an optimal assignment problem andsolving it via Hungarian algorithm [18]. SPAHM [44] applies the same procedure to aggregate Gaussian topic models, hidden Markov models, and so on. FedMA [38] pointsout PFNM does not apply to large-scale networks and proposes a layer-wise matching method. [33] utilizes optimaltransport [2] to fuse models with different initializations.These methods are all post-processing ones that need additional computation costs. Fed2 is recently proposed toalign features during local training via separating featuresinto different groups. However, it needs to carefully designthe architectures. Differently, we take a more fine-grainedalignment of neurons rather than network groups, and wewill show our method is more general.Position Encoding: Position encoding is popular in sequence learning architectures, e.g., ConvS2S [8] and transformer [36], etc. These architectures take position encodings to consider the order information. Relative positionencoding [31] is more applicable to sequences with variouslengths. Some other studies are devoted to interpreting whatposition encodings learn [37, 39]. Another interesting workis applying position encodings instead of zero-padding toGAN [41] as spatial inductive bias. Differently, we resort toposition encodings to bind neurons in their positions in FL.Furthermore, these works only consider position encodingsat the input layer, while we couple them with neurons.3. Position-Aware NeuronsIn this section, we investigate the permutation invarianceof neural networks and introduce PANs to control it.3.1. Permutation Invariance PropertyAssume an MLP network has L 1 layers (containinginput and output layer), and each layer contains Jl neurons,where l {0, 1, · · · , L} is the layer index. J0 and JL areinput and output dimensions. We denote the parameters ofeach layer as the weight matrix Wl RJl Jl 1 and the biasvector bl RJl , l {1, 2, · · · , L}. The input layer doesnot have parameters. We use hl RJl as the activations ofthe lth layer. We have hl fl (Wl hl 1 bl ), where fl (·)is the element-wise activation function, e.g., ReLU [28].fL (x) x denotes no activation function in the outputlayer. Sometimes, we use y v T f (W x b) to represent anetwork with only one hidden layer and the output dimension is one (called as MLP0), where x RJ0 , W RJ J0 ,J Jb RJ , v RJ . WePuse Π {0, 1} P as a permutationmatrix that satisfies j Π·,j 1 and j Πj,· 1. Easily,we have some properties: ΠT Π I, Πa Πb Π(a b),Πa Πb Π(a b), where I is the identity matrix anddenotes Hadamard product. If f (·) is an element-wisefunction, f (Πx) Πf (x).For MLP0, we have y (Πv)T f (ΠW x Πb) Tv f (W x b), implying that if we permute the parameters10083

properly, the output of a certain neural network does notchange, i.e., the permutation invariance property. Extending it to MLP, the layer-wise permutation process ishl fl (Πl Wl ΠTl 1 hl 1 Πl bl ),(1)where Π0 I and ΠL I, meaning that the input and output layers are not shuffled. For ConvNet [17, 32], we takeconvolution kernels as basic units. The convolution parameters could be denoted as Wl RCl wl hl Cl 1 , where thefour dimensions denote the number of output/input channels (Cl , Cl 1 ) and the kernel size (wl , hl ). The permutation could be similarly applied as Πl Wl ΠTl 1 . ForResNet [10], we use hl fl (Πl Wl ΠTl 1 hl ) Πl Ml ΠTl 1 hlto permute all parameters in a basic block including theshortcut (if shortcut is not used, Ml I).3.2. Position-Aware NeuronsThe essential reason for the permutation invariance ofneural networks is that neurons have nothing to do withtheir positions. Hence, an intuitive improvement is fusingposition-related values (position encodings) to neurons. Wepropose Position-Aware Neurons (PANs), adding or multiplying position encodings to neurons’ outputs, i.e.,PAN : hl fl (Wl hl 1 bl el ),PAN : hl fl ((Wl hl 1 bl ) el ),(2)(3)where el denotes position encodings that are only related topositions and not learnable. We use “PAN ” and “PAN ”to represent additive and multiplicative PANs, respectively.We use sinusoidal functions to generate el as commonlyused in previous position encoding works [36], i.e.,PAN : el,j A sin (2πT j/J) [ A, A],(4)PAN : el,j 1 A sin (2πT j/J) [1 A, 1 A],(5)where T and A respectively denotes the period and amplitude of position encodings, and j {0, 1, · · · , J 1} is theposition index of a neuron. For ConvNet, we assign position encodings for each channel, and j is the channel index.Notably, if we take T 0 or A 0, PANs degenerateinto normal neurons. In practice, we only apply PANs tothe hidden layers, while the input and output layers remainunchanged, i.e., l {1, 2, · · · , L 1} for el . With PANs,the permutation process in Eq. 1 could be reformulated asPAN : hl,sf fl (Πl Wl ΠTl 1 hl 1,sf Πl bl el ),PAN : hl,sf fl ((Πl Wl ΠTl 1 hl 1,sf Πl bl )(6)el ), (7)where the subscript “sf” denotes “shuffled” (or permuted).To measure the output change after shuffling, we define theshuffle error as:Err(A, T, {Πl }Ll 0 ) khL,sf hL k/JL ,(8)and this error on MLP0 without considering bias (i.e., y v T f (W x e)) isPAN : Err(A, T, Π) ysf y (Πv)T f (ΠW x e) v T f (W x e) (Πv)T f (ΠW x e) (Πv)T f (ΠW x Πe) ysf ,(9) (Πe e)T ewhere we take ysf (Πv)T f (ΠW x e) as the function of eand take Taylor expansion as an approximation. Obviously,shuffle error is closely related to the strength of permutation, i.e., Π I. For example, if Π I, the network is notshuffled and the outputs are kept unchanged. Then, if wetake equal values as position encodings, i.e., ej ei , i, j,the output also does not change because Πe e. Thiscan be obtained via taking α 0 or T 0. If we takea larger T (e.g., 1) and larger α (e.g., 0.05), Err is generally non-zero because Πe 6 e. The error of multiplicativePANs is similar. We abstract PANs as a switch: if we takeequal/varied position encodings, PANs are turned off/on,and hence the network keeps/loses the permutation invariance property (i.e., the same/different outputs after permutation). As illustrated at the left of Fig. 1, the five neuronsof a certain hidden layer are shuffled while the position encodings they are going to add/multiply are not shuffled, andthe outputs will change with PANs turned on.Furthermore, are there any essential differences betweenadditive and multiplicative PANs, and how much influencedo they have on the shuffle error? In Eq. 9, the shuffle erroris partially determined by ysf / e, and we extent this gradient to MLP with multiple layers. We assume all layers havethe same number of neurons (i.e., Jl J, l) and take thesame position encodings (i.e., el e RJ , l). We denotesl,sf Πl Wl ΠTl 1 hl 1,sf Πl bl and obtain the recursivegradient expressions: hl,sf sl,sf hl 1,sfPAN : D(fl0 ) I , (10) e hl 1,sf e hl,sf sl,sf hl 1,sfPAN : D(fl0 )[e]J e hl 1,sf e D(sl,sf ) , (11)where D(·) transforms a vector to a diagonal matrix and [·]Jrepeats a vector J times to obtain a matrix. fl0 denotes thegradient of activation functions, whose element is 0 or 1 inReLU. If we expand Eq. 10 and Eq. 11 correspondingly, we hL,sfof additive PANs does notwill find that the gradient eexplicitly rely on e. However, for the multiplicative one, hl,sf hl 1,sfand [e]J , which could lead to e is relevant to e10084

1.00.8Rkepta polynomial term AL 1 (resulted from [e]J · · · [e]J ,informally). Hence, we conclude: taking PANs as a switchcould control the permutation invariance property of neuralnetworks. The designed multiplicative PANs will make thisswitch more sensitive.0.60.40.20.04. FL with PANs0.00.10.20.30.40.50.60.70.80.9Psf 0.11.0PsfIn this section, we briefly introduce FedAvg [27] and analyze the effects of PANs when applied to FL.Figure 2. Left: how much neurons are not shuffled with variousPsf . Right: a permutation matrix demo with Psf 0.1.4.1. FedAvgSuppose we have a server and K clients with variousdata distributions. FedAvg first initializes a global model θ0on the server. Then, a small fraction (i.e. R [0, 1]) ofclients St download the global model and update it on theirlocal data for E epochs, and then upload the updated model(k)θ0 to the server. Then, the server takes a coordinate-basedP(k)parameter averaging, i.e., θ1 S1t k St θ0 . Next, θ1will be sent down for a new communication round. Thiswill be repeated for H communication rounds. Because theparameters could be misaligned during local training, someworks [38, 44, 45] are devoted to finding the correspondences between clients’ uploaded neurons for better aggre(1)(2)gation. For example, the parameters Wl and Wl maybe misaligned, and we should search for proper matrices(1)(2)T), rather thanto match them, i.e., 12 (Wl Ml Wl Ml 1(1)(2)1 Wl ) [33]. However, searching for appropriate2 (WlM{l,l 1} is challenging. Generally, these works require ad-ditional data to search for proper alignment. In addition, thematching process has typically to solve complex optimization problems, such as optimal transport or optimal assignment, leading to additional computational overhead. An intuitive question is: could we pre-align the neurons duringlocal training instead of post-matching?4.2. Applying PANs to FLReplacing traditional neurons with PANs in FL isstraightforward to implement. Why does such a subtle improvement help? We heuristically expect PANs in FL couldbring such effects: PANs could limit the dislocation of neurons since the disturbance of them will bring significantchanges to the outputs of the neural network and lead tohigher training errors and fluctuations. Theoretically, theforward pass on the kth client with PANs is as follows:(k)PAN : hlPAN :(k)hl(k)(k) (k)(k)hl 1 bl el ),(k)(k) (k)(k)fl ((Wl hl 1 bl ) el ). fl (Wl(12) (13)Notably, the position encodings are commonly utilizedacross clients, i.e., the forward pass across local clientsshare some consistent information. Then, the parameters’gradient of Eq. 12 and Eq. 13 can be calculated by:(k)(k)PAN : hl / bl(k)(k)PAN : hl / bl(k) 0 D(fl(k) 0 D(fl),(14))D(el ),(15)where we only give the gradient of bias for simplification.The gradients of multiplicative PANs directly contain thesame position information across clients (e.g., el ) in spite of(k)various data distributions (e.g., hl 1 ). For the additive ones,(k) 0the impact of el is implicit because fl is related to el , butnevertheless, the effect is not significant as multiplicativeones. Overall, el could regularize and rectify local gradient directions, keeping some ingredients consistent duringbackward propagation. As an extreme case, if A in el isvery large, the gradients in Eq. 14 and Eq. 15 will tend tobe the same, mitigating the weight divergence completely.However, setting el too large will make the neural networkdifficult to train and the data information is completely covered, so the strength of el (i.e., A) is a tradeoff.5. ExperimentsWe study how much influence the proposed PANshave on both centralized training and decentralized training (i.e., FL). The datasets used are Mnist [20], FeMnist [4], SVHN [29], GTSRB [34], Cifar10/100 [16], andCinic10 [5]. FeMnist is recommended by LEAF [4]and FedScale [19]. We use MLP for Mnist/FeMnist,VGG [32] for SVHN/GTSRB/Cifar10, ResNet20 [10] forCifar100/Cinic10 by default if without more declarations.We sometimes take VGG9 used in previous FL works [26,38, 43]. For centralized training, we use the provided training and test set correspondingly. For FL, we split the training set according to Dirichlet distributions, where Dir(α)controls the non-i.i.d. level. Smaller α leads to more noni.i.d. cases. For each FL scene, we report several key hyperparameters: number of clients K, client participation ratioR, number of local training epochs E, Dirichlet alpha α,number of communication rounds H. For PANs, we reportT and A. With A 0.0, we turn off PANs, i.e., using traditional neurons or the baselines; with A 0.0, we turn on10085

Shuffle Error40.280.14000330002PAN 200011000Test Accuracy0.3PAN Shuffle Error2T0.980.980.41160.970.97Mnist MLPSVHN VGG13Cifar10 ResNet200.960.950.150.20.550.70.0A.0A 00.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5PAN AFigure 3. Left: shuffle error (Eq. 8) with various T and A (PAN ).Right: the difference between PAN and PAN (T 1). (VGG13 isused, more networks are in Supp.)5.1. Centralized TrainingShuffle Test: We first propose a procedure to measure thedegree of permutation invariance of a certain neural network, that is, how large the shuffle error in Eq. 8 is aftershuffling the neurons. We name this procedure shuffle test.Given a neural network and a batch of data, we first obtainthe outputs. Then, we shuffle the neurons of hidden layers.The shuffle process is shown in Supp, where Psf controlsthe disorder level of the constructed permutation matrices.Then we could get the outputs after shuffling and then calculate the shuffle error. We vary Psf in [0, 1] and plot the ratio of permutation matrices’ diagonal ones (i.e., how muchneurons are not shuffled). We denote this ratio as Rkept andplot them in Fig. 2 (average of 10 experiments), where wealso show a generated permutation matrix with Psf 0.1.Shuffle Error with Random Data: With different hyperparameters of T and A in Eq. 4/Eq. 5, we use random datagenerated from Gaussian distributions (i.e., xi,· N (0, 1))to calculate the shuffle error. The results based on VGG13are shown in Fig. 3. The error is more related to A whileless sensitive to T . This is intuitive because T controls local volatility while neuron permutation could happen globally, e.g., the first neuron could swap positions with the lastneuron. A larger A leads to a larger shuffle error, i.e., themore serious the network loses the permutation invarianceproperty. In addition, the shuffle error based on the additivePANs increases linearly, while that based on the multiplicative PANs increases quickly. This verifies the theoreticalanalysis in Sect. 3.2. However, in practice, a larger A maycause training failure and we only set A [0.0, 0.25] foradditive PANs and A [0.0, 0.75] for multiplicative PANs(the bold part on the right side of Fig. 3).Influence on Inference: We study the influence of PANson test accuracies. We use MLP on Mnist, VGG13 onPAN T 1 A 0.05PAN T 1 A 0.75PAN T 8 A 0.75.01.25.05.75.75A 0A 0A 0A 0A 0T 1 N T 1 N T 1 N T 1 N T 8PAPAPAPA0.00.10.20.30.40.50.60.7Mnist MLP0.80.91.0Psf0.950.80.900.850.80PAN T 1 A 0.250.75PAN T 1 A 0.05PAN T 1 A 0.75PAN T 8 A 0.750.10.20.30.40.5A 0.0PAN T 1 A 0.010.6A 0.0PAN T 1 A 0.010.0PANs. We leave PANs turned on by default if with no mention of the state on/off or the value of A. Details of datasets,networks and training are presented in Supp.PAN T 1 A 0.250.9300Test Accuracy50.0A 0.0PAN T 1 A 0.010.950.940.940.930.00.96PAN T 1 A 0.25PAN T 1 A 0.05PAN T 1 A 0.75PAN T 8 A 0.750.40.20.6SVHN VGG130.70.80.91.0Psf0.00.10.20.30.40.50.6Cifar10 ResNet200.70.80.91.0PsfFigure 4. The first: test accuracy of models trained with different PANs. The other three: test accuracy change after manualpermutation with various Psf .SVHN, and ResNet20 on Cifar10. We first train modelswith various PANs until convergence, and the model performances are shown in the first figure of Fig. 4. The horizontal dotted lines show the accuracies of normal networks,and the solid segments show the results of networks withvarious PANs. We find that introducing PANs to neural networks does not improve performances, but brings a slightdegradation. That is, PANs could make the network somewhat harder to train. More studies of how PANs influencethe network predictions could be found in Supp. Then, weinvestigate the shuffle error reflected by the change of testaccuracies. Specifically, we shuffle the trained network tomake predictions on the test set. We vary several groups ofT and A for PANs. We show the results in the last threefigures of Fig. 4. With larger Psf , i.e., more neurons areshuffled, the test accuracy of the network with A 0.0 doesnot change (the permutation invariance property). However,larger A leads to more significant performance degradation(A 0.25 vs. A 0.01 for PAN ; A 0.75 vs. A 0.05for PAN ). PAN makes the network more sensitive to shuffling than PAN (curves with “ ” degrades significantly).With different T {1, 8}, the performance degradation isnearly the same, again showing that PANs are robust to T .These verify the conclusions in Sect. 3.2. Overall, PANswork as a tradeoff between model performances and control of permutation invariance.5.2. Decentralized TrainingThen we study the effects of introducing PANs to FL.We first present some empirical studies to verify the prealignment effects of PANs, and then show performances.How many neurons are misaligned in FL? Althoughsome previous works [38,43,45] declare that neurons could10086

0.9620.9250.891Mnist MLP α .20.40.60.81.01.52.03.04.05.0Nsf0.7Mnist MLP0.60.70.40.2Nsfα 10.0α 1.0α 0.1Shuffle α 10.00.60.5NsfON (PAN A 0.1)ON (PAN A 0.1)640.520bbbb1.W FC1.2.W FC2.3.W FC3.4.W FC4.FCFCFCFCα 10.0α 1.0α 0.1Shuffle α 10.0OFF (A 0.0)80.0.W1FCbbb1.b2.W FC2.3.W FC3.4.W FC4.FCFCFCFCFigure 6. Weight divergence with PANs off/on. (EMnist, more datasets’ results are in Supp.)0.4SVHN VGG111.0Mnist MLP α 0.1OFF (A 0.0) 5, MLP on0.00.10.20.40.60.81.01.52.03.04.05.00.8α 10.0α 1.0α 0.1Shuffle α .10.20.40.60.81.01.52.03.04.05.0Test Accuracy0.90.81.5Cifar10 ResNet20NsfOFF (A 0.0) [0.195]ON (PAN A 0.1) [0.414]Figure 5. Top: how much neurons are not shuffled with variousNsf . Bottom: test accuracies of FL with various α (dotted lines)and accuracies after manual shufflling on i.i.d. data (α 10.0)(red scatters).Do PANs indeed reduce the possibility of neuron misalignment? We propose several strategies from aspects ofparameters, activations, and preference vectors to comparethe neuron correspondences in FL with PANs off/on. ForFigure 7. Optimal assignment matrix with PANs off/on, left vs.right. (α 1.0, E 20, VGG9 Conv5 on Cifar10, more results are in Supp.)Global ModelOFF (A 0.0) [0.551]Local Modelbe dislocated when faced with non-i.i.d. data, they do notshow this in evidence and do not show the degree of misalignment. We present a heuristic method: we manuallyshuffle the neurons during local training with i.i.d. dataand study how much misalignment could cause the performance to drop to the same as training with non-i.i.d. data.Specifically, during each client’s training step (each batch asNsf,a step), we shuffle the neurons with a probability E Nk /Bwhere B,E,Nk are respectively the batch size, the numberof local epochs, and the number of local data samples. Ineach shuffle process, we keep Psf 0.1. Nsf determineshow many times the network could be shuffled during localtraining. Larger Nsf means more neurons are shuffled uponfinishing training, e.g., Nsf 1.0 keeps approximately 84%neurons not shuffled as shown in Fig. 5. The calculation ofRkept in Fig. 5 is presented in Supp. Then, we show thetest accuracies of FedAvg [27] under various levels of noni.i.d. data, i.e., α {10.0, 1.0, 0.1}. The results correspond to the three horizontal lines in the bottom three figures of Fig. 5. The scatters in red show the performancesof shuffling neurons with various Nsf . Obviously, even withi.i.d. data, the larger the Nsf , the worse the performance.This implies that neuron misalignment could actually leadto performance degradation. Compared with non-i.i.d. performances, taking Cifar10 as an example, setting Nsf 0.2could make the i.i.d. (α 10.0) performance degrade to thesame as non-i.i.d. (α 0.1), that is, approximately 3.8% neurons are misaligned on each client. This may provide someenlightenment for the quantitative measure of how manyneurons are misaligned in FL with non-i.i.d. data.Neuron/Channel IndexON (PAN A 0.1) [0.617]5500550Neuron/Channel IndexClass Index0.982Class Index1.00.8Weight DivergenceRkept1.00Figure 8. Preference vectors with PANs off/on, left vs. right. (α 1.0, VGG9 Conv6 on Cifar10, more results are shown in Supp.)PANs turned on, we use multiplicative PANs with T 1.0and A 0.1 by default.I. Weight Divergence: Weight divergence [47] measuresthe variances of local parameters. Specifically, we calcuP(k)late S1t k St kWl Wl k2 for each layer l. Wl P(k)1denotes the averaged parameters. Thek St Wl St weight divergences of MLP on Mnist with α {1.0, 0.1}are in Fig. 6, where PANs could reduce the divergences alot (the red bars). This corresponds to the explanation inSect. 4.2 that clients’ parameters are partially updated towards the same direction.II. Matching via Optimal Assignment: We feed 500 testsamples into the network and obtain the activations of eachneuron as its representation. Neurons’ representations ofglobal and local model are denoted as hl RJl m and(k)hl RJl m , where m 500. Then we search for the10087

ScaffoldFedProx0.40Cinic10 α 10.0PANs OFFPANs ON0.650.7PANs OFFPANs ON0.60.70PANs OFFPANs ON0.600.600.65PANs OFFPANs ON0.550.50PANs OFFPANs ON0.650.600.650.650.500.700.650.60PANs OFFPANs ON0.550.5500700800900100060Communication Round50Communication Round505006007008009001000405003000.501050.500PANs OFFPANs .5050.550.5000.55100.600.60Communication RoundPANs OFFPANs ON0.650.65200.60Communication RoundPANs OFFPANs ON0.550.70PANs OFFPANs ON0.65100.600.800.78 0.37Cinic10 α 0.5OFFON 0.76OFFON0.78 1.93 0.51 0.23 0.02 2.230.80 0.600.76 3.68 1.770.74 2.130.760.72 0.460.780.74 0.160.700.720.660.74vgAFedptOFedScaxONffold edProMOF 0.560.680.70vgAFedOFedptxONffold edProMOFScavgAFedOFedptScaxONffold edProMOF0.500.70PANs OFFPANs ON0.650.820.820.80 0.150.760.60PANs OFFPANs ON0.550.50PANs OFFPANs ON0.650.600.600.600.550.700.70Cifar10 VGG110.75Cinic10 ResNet200.700.65PANs OFFPANs ON0.800.750.600.550.300.80OFFON0.84Cifar100 ResNet200.300.80PANs OFFPANs ON0.650.3000.750.75PANs OFFPANs ON0.3500.80PANs OFFPANs ON0.35400.300.80PANs OFFPANs ON0.35300.300.700.40PANs OFFPANs ON0.3520PANs OFFPANs ON0.35Cinic10 α 1.00.840.86Test Accuracy0.450.40FeMnist MLP0.450.450.50Test AccuracyMOON0.500.400.60Test AccuracyFedOpt0.450.4010Test AccuracyTest AccuracyFedAvg0.45Communication RoundFigure 10. Comparisons under various levels of non-i.i.d. data onCinic10. Smaller α implies more non-i.i.d. data. (More datasets areshown in Supp.)Figure 9. Comparison results on non-i.i.d. data (α 0.1). Rowsshow datasets and columns show FL algorithms. PANs could universally improve these algorithms. (More datasets are shown in Supp.)Do PANs bring performance improvement in FL? Wethen compare the performances of FL with PANs off/on.I. Universal Application of PANs: We first apply PANsto some popular FL algorithms as introduced in Sect. 2,including FedAvg [27], FedProx [23], FedOpt [30], Scaffold [14], MOON [22]. These methods solve the non-i.i.d.problem from different aspects. Training details of these algorithms are p

Position Encoding: Position encoding is popular in se-quence learning architectures, e.g., ConvS2S [8] and trans-former [36], etc. These architectures take position encod-ings to consider the order information. Relative position encoding [31] is more applicable to sequences with various lengths. Some other studies are devoted to interpreting what