Transformer-XL: Attentive Language Models Beyond A Fixed .

Transcription

Transformer-XL: Attentive Language ModelsBeyond a Fixed-Length ContextZihang Dai 12 , Zhilin Yang 12 , Yiming Yang1 , Jaime Carbonell1 ,Quoc V. Le2 , Ruslan Salakhutdinov11Carnegie Mellon University, 2 Google edu, qvl@google.comAbstractTransformers have a potential of learninglonger-term dependency, but are limited by afixed-length context in the setting of languagemodeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanismand a novel positional encoding scheme. Ourmethod not only enables capturing longer-termdependency, but also resolves the context fragmentation problem. As a result, TransformerXL learns dependency that is 80% longer thanRNNs and 450% longer than vanilla Transformers, achieves better performance on bothshort and long sequences, and is up to 1,800 times faster than vanilla Transformers duringevaluation. Notably, we improve the state-ofthe-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103,21.8 on One Billion Word, and 54.5 on PennTreebank (without finetuning). When trainedonly on WikiText-103, Transformer-XL manages to generate reasonably coherent, noveltext articles with thousands of tokens. Ourcode, pretrained models, and hyperparametersare available in both Tensorflow and PyTorch1 .1IntroductionLanguage modeling is among the important problems that require modeling long-term dependency,with successful applications such as unsupervisedpretraining (Dai and Le, 2015; Peters et al., 2018;Radford et al., 2018; Devlin et al., 2018). However, it has been a challenge to equip neuralnetworks with the capability to model long-termdependency in sequential data. Recurrent neural networks (RNNs), in particular Long Short Equal contribution. Order determined by swapping theone in Yang et al. lTerm Memory (LSTM) networks (Hochreiter andSchmidhuber, 1997), have been a standard solution to language modeling and obtained strongresults on multiple benchmarks. Despite thewide adaption, RNNs are difficult to optimizedue to gradient vanishing and explosion (Hochreiter et al., 2001), and the introduction of gating in LSTMs and the gradient clipping technique (Graves, 2013) might not be sufficient tofully address this issue. Empirically, previouswork has found that LSTM language models use200 context words on average (Khandelwal et al.,2018), indicating room for further improvement.On the other hand, the direct connections between long-distance word pairs baked in attention mechanisms might ease optimization and enable the learning of long-term dependency (Bahdanau et al., 2014; Vaswani et al., 2017). Recently, Al-Rfou et al. (2018) designed a set of auxiliary losses to train deep Transformer networksfor character-level language modeling, which outperform LSTMs by a large margin. Despite thesuccess, the LM training in Al-Rfou et al. (2018)is performed on separated fixed-length segmentsof a few hundred characters, without any information flow across segments. As a consequence ofthe fixed context length, the model cannot captureany longer-term dependency beyond the predefined context length. In addition, the fixed-lengthsegments are created by selecting a consecutivechunk of symbols without respecting the sentenceor any other semantic boundary. Hence, the modellacks necessary contextual information needed towell predict the first few symbols, leading to inefficient optimization and inferior performance. Werefer to this problem as context fragmentation.To address the aforementioned limitations offixed-length contexts, we propose a new architecture called Transformer-XL (meaning extra long).We introduce the notion of recurrence into our2978Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics

deep self-attention network. In particular, insteadof computing the hidden states from scratch foreach new segment, we reuse the hidden states obtained in previous segments. The reused hiddenstates serve as memory for the current segment,which builds up a recurrent connection betweenthe segments. As a result, modeling very longterm dependency becomes possible because information can be propagated through the recurrent connections. Meanwhile, passing information from the previous segment can also resolvethe problem of context fragmentation. More importantly, we show the necessity of using relativepositional encodings rather than absolute ones, inorder to enable state reuse without causing temporal confusion. Hence, as an additional technical contribution, we introduce a simple but moreeffective relative positional encoding formulationthat generalizes to attention lengths longer than theone observed during training.Transformer-XL obtained strong results on fivedatasets, varying from word-level to characterlevel language modeling. Transformer-XL is alsoable to generate relatively coherent long text articles with thousands of tokens (see Appendix E),trained on only 100M tokens.Our main technical contributions include introducing the notion of recurrence in a purely selfattentive model and deriving a novel positional encoding scheme. These two techniques form a complete set of solutions, as any one of them alonedoes not address the issue of fixed-length contexts. Transformer-XL is the first self-attentionmodel that achieves substantially better resultsthan RNNs on both character-level and word-levellanguage modeling.2Related WorkIn the last few years, the field of language modeling has witnessed many significant advances,including but not limited to devising novel architectures to better encode the context (Bengioet al., 2003; Mikolov et al., 2010; Merity et al.,2016; Al-Rfou et al., 2018), improving regularization and optimization algorithms (Gal and Ghahramani, 2016) , speeding up the Softmax computation (Grave et al., 2016a) , and enriching the outputdistribution family (Yang et al., 2017).To capture the long-range context in languagemodeling, a line of work directly feeds a representation of the wider context into the networkas an additional input. Existing works rangefrom ones where context representations are manually defined (Mikolov and Zweig, 2012; Ji et al.,2015; Wang and Cho, 2015) to others that rely ondocument-level topics learned from data (Dienget al., 2016; Wang et al., 2017).More broadly, in generic sequence modeling,how to capture long-term dependency has been along-standing research problem. From this perspective, since the ubiquitous adaption of LSTM,many efforts have been spent on relieving thevanishing gradient problem, including better initialization (Le et al., 2015), additional loss signal (Trinh et al., 2018), augmented memory structure (Ke et al., 2018) and others that modify the internal architecture of RNNs to ease the optimization (Wu et al., 2016; Li et al., 2018). Differentfrom them, our work is based on the Transformerarchitecture and shows that language modeling asa real-world task benefits from the ability to learnlonger-term dependency.3ModelGiven a corpus of tokens x (x1 , . . . , xT ), thetask of language modeling is to estimate the jointprobability P (x), whichQis often auto-regressivelyfactorized as P (x) t P (xt x t ). With thefactorization, the problem reduces to estimatingeach conditional factor. In this work, we stick tothe standard neural approach to modeling the conditional probability. Specifically, a trainable neural network is used to encode the context x t intoa fixed size hidden state, which is multiplied withthe word embeddings to obtain the logits. The logits are then fed into the Softmax function, yieldinga categorical probability distribution over the nexttoken.3.1Vanilla Transformer Language ModelsIn order to apply Transformer or self-attention tolanguage modeling, the central problem is how totrain a Transformer to effectively encode an arbitrarily long context into a fixed size representation.Given infinite memory and computation, a simple solution would be to process the entire context sequence using an unconditional Transformerdecoder, similar to a feed-forward neural network.However, this is usually infeasible with the limitedresource in practice.One feasible but crude approximation is to splitthe entire corpus into shorter segments of man-2979

x1x2x3x4x5x6Segment 1x7Segment 2x8x1x2x3x4x5x6x1Limited Contextx2x3x4x5x6Limited Context(a) Train phase.x1x2x3x4x5x6Limited Context(b) Evaluation phase.Figure 1: Illustration of the vanilla model with a segment length 4.ageable sizes, and only train the model withineach segment, ignoring all contextual informationfrom previous segments. This is the idea adoptedby Al-Rfou et al. (2018). We call it the vanillamodel and visualize it in Fig. 1a. Under thistraining paradigm, information never flows acrosssegments in either the forward or backward pass.There are two critical limitations of using a fixedlength context. First, the largest possible dependency length is upper bounded by the segmentlength, which is a few hundred on character-levellanguage modeling (Al-Rfou et al., 2018). Therefore, although the self-attention mechanism is lessaffected by the vanishing gradient problem compared to RNNs, the vanilla model is not able tofully exploit this optimization advantage. Second,though it is possible to use padding to respect thesentence or other semantic boundaries, in practiceit has been standard practice to simply chunk longtext into fixed-length segments due to improvedefficiency (Peters et al., 2018; Devlin et al., 2018;Al-Rfou et al., 2018). However, simply chunkinga sequence into fixed-length segments will lead tothe context fragmentation problem as discussed inSection 1.During evaluation, at each step, the vanillamodel also consumes a segment of the same lengthas in training, but only makes one prediction at thelast position. Then, at the next step, the segmentis shifted to the right by only one position, and thenew segment has to be processed all from scratch.As shown in Fig. 1b, this procedure ensures thateach prediction utilizes the longest possible context exposed during training, and also relieves context fragmentation issue encountered in training.However, this evaluation procedure is extremelyexpensive. We will show that our proposed architecture is able to substantially improve the evaluation speed.3.2Segment-Level Recurrence with StateReuseTo address the limitations of using a fixed-lengthcontext, we propose to introduce a recurrencemechanism to the Transformer architecture. During training, the hidden state sequence computedfor the previous segment is fixed and cached tobe reused as an extended context when the modelprocesses the next new segment, as shown in Fig.2a. Although the gradient still remains within asegment, this additional input allows the networkto exploit information in the history, leading to anability of modeling longer-term dependency andavoiding context fragmentation. Formally, let thetwo consecutive segments of length L be s [x ,1 , · · · , x ,L ] and s 1 [x 1,1 , · · · , x 1,L ]respectively. Denoting the n-th layer hidden statesequence produced for the -th segment s byhn 2 RL d , where d is the hidden dimension.Then, the n-th layer hidden state for segment s 1is produced (schematically) as follows, 1nenh 1 SG(h 11 ) hn 1 ,nnn 1 en 1 en 1 qn 1 , k 1 , v 1 h 1 Wq , h 1 Wk , h 1 Wv ,nnnnh 1 Transformer-Layer (q 1 , k 1 , v 1 ) .where the function SG(·) stands for stop-gradient,the notation [hu hv ] indicates the concatenationof two hidden sequences along the length dimension, and W· denotes model parameters. Compared to the standard Transformer, the critical difference lies in that the key kn 1 and value v n 1e n 1 andare conditioned on the extended context h 1hence hn 1 cached from the previous segment.We emphasize this particular design by the greenpaths in Fig. 2a.With this recurrence mechanism applied to every two consecutive segments of a corpus, it essentially creates a segment-level recurrence in thehidden states. As a result, the effective context being utilized can go way beyond just two segments.However, notice that the recurrent dependency between hn 1 and hn 1 shifts one layer downwards2980

x1x2x3Fixed (No Grad)x4x5x6x7x8x1x2x3x4New Segmentx5x6x7x8x9Fixed (No Grad)x10x11x12x1x2x3x4x5x6New Segment(a) Training phase.x7x8x9x10x11x12Extended Context(b) Evaluation phase.Figure 2: Illustration of the Transformer-XL model with a segment length 4.per-segment, which differs from the same-layerrecurrence in conventional RNN-LMs. Consequently, the largest possible dependency lengthgrows linearly w.r.t. the number of layers as wellas the segment length, i.e., O(N L), as visualized by the shaded area in Fig. 2b. Thisis analogous to truncated BPTT (Mikolov et al.,2010), a technique developed for training RNNLMs. However, different from truncated BPTT,our method caches a sequence of hidden states instead of the last one, and should be applied together with the relative positional encoding technique described in Section 3.3.Besides achieving extra long context and resolving fragmentation, another benefit that comeswith the recurrence scheme is significantly fasterevaluation. Specifically, during evaluation, therepresentations from the previous segments canbe reused instead of being computed from scratchas in the case of the vanilla model. In our experiments on enwiki8, Transformer-XL is up to1,800 times faster than the vanilla model duringevaluation (see Section 4).Finally, notice that the recurrence scheme doesnot need to be restricted to only the previous segment. In theory, we can cache as many previoussegments as the GPU memory allows, and reuseall of them as the extra context when processingthe current segment. Thus, we can cache a predefined length-M old hidden states spanning (possibly) multiple segments, and refer to them as thememory mn 2 RM d , due to a clear connection tothe memory augmented neural networks (Graveset al., 2014; Weston et al., 2014). In our experiments, we set M equal to the segment length during training, and increase it by multiple times during evaluation.3.3Relative Positional EncodingsWhile we found the idea presented in the previous subsection very appealing, there is a crucial technical challenge we haven’t solved in or-der to reuse the hidden states. That is, how canwe keep the positional information coherent whenwe reuse the states? Recall that, in the standardTransformer, the information of sequence order isprovided by a set of positional encodings, denotedas U 2 RLmax d , where the i-th row Ui corresponds to the i-th absolute position within a segment and Lmax prescribes the maximum possiblelength to be modeled. Then, the actual input to theTransformer is the element-wise addition of theword embeddings and the positional encodings. Ifwe simply adapt this positional encoding to ourrecurrence mechanism, the hidden state sequencewould be computed schematically byh 1 f (h , Es 1 U1:L )h f (h 1 , E s U1:L ),where Es 2 RL d is the word embedding sequence of s , and f represents a transformationfunction. Notice that, both Es and Es 1 are associated with the same positional encoding U1:L .As a result, the model has no information to distinguish the positional difference between x ,j andx 1,j for any j 1, . . . , L, resulting in a sheerperformance loss.In order to avoid this failure mode, the fundamental idea is to only encode the relative positional information in the hidden states. Conceptually, the positional encoding gives the model atemporal clue or “bias” about how informationshould be gathered, i.e., where to attend. For thesame purpose, instead of incorporating bias statically into the initial embedding, one can inject thesame information into the attention score of eachlayer. More importantly, it is more intuitive andgeneralizable to define the temporal bias in a relative manner. For instance, when a query vector q ,iattends on the key vectors k , i , it does not needto know the absolute position of each key vectorto identify the temporal order of the segment. Instead, it suffices to know the relative distance between each key vector k ,j and itself q ,i , i.e. i j.Practically, one can create a set of relative posi-2981

tional encodings R 2 RLmax d , where the i-th rowRi indicates a relative distance of i between twopositions. By injecting the relative distance dynamically into the attention score, the query vectorcan easily distinguish the representations of x ,jand x 1,j from their different distances, makingthe state reuse mechanism feasible. Meanwhile,we won’t lose any temporal information, as the absolute position can be recovered recursively fromrelative distances.Previously, the idea of relative positional encodings has been explored in the context of machinetranslation (Shaw et al., 2018) and music generation (Huang et al., 2018). Here, we offer a different derivation, arriving at a new form of relative positional encodings, which not only has aone-to-one correspondence to its absolute counterpart but also enjoys much better generalizationempirically (see Section 4). Firstly, in the standardTransformer (Vaswani et al., 2017), the attentionscore between query qi and key vector kj withinthe same segment can be decomposed as Aabsi,j Exi Wq Wk Exj Exi Wq Wk Uj{z} {z} (b)(a) U i Wq Wk Exj {z(c)} U i Wq Wk Uj {z(d)}.Following the idea of only relying on relative positional information, we propose to reparameterize the four terms as follows Finally, we deliberately separate the two weightmatrices Wk,E and Wk,R for producing thecontent-based key vectors and location-basedkey vectors respectively.Under the new parameterization, each term hasan intuitive meaning: term (a) represents contentbased addressing, term (b) captures a contentdependent positional bias, term (c) governs aglobal content bias, and (d) encodes a global positional bias.In comparison, the formulation in Shaw et al.(2018) only has terms (a) and (b), dropping thetwo bias terms (c) and (d). Moreover, Shaw et al.(2018) merge the multiplication Wk R into a single trainable matrix R̂, which abandons the inductive bias built into the original sinusoid positionalencoding (Vaswani et al., 2017). In contrast, ourrelative positional embedding R adapts the sinusoid formulation. As a benefit of the inductivebias, a model trained on a memory of some certainlength can automatically generalize to a memoryseveral times longer during evaluation.Equipping the recurrence mechanism with ourproposed relative positional embedding, we finallyarrive at the Transformer-XL architecture. Forcompleteness, we summarize the computationalprocedure for a N -layer Transformer-XL with asingle attention head here. For n 1, . . . , N :enh An ,i,j(b)11) hn enWqn , h n qn ,i k ,j 11 n enWk,E, h (d) The first change we make is to replace all appearances of the absolute positional embeddingUj for computing key vectors in term (b) and(d) with its relative counterpart Ri j . This essentially reflects the prior that only the relativedistance matters for where to attend. Note thatR is a sinusoid encoding matrix (Vaswani et al.,2017) without learnable parameters. Secondly, we introduce a trainable parameter u 2 Rd to replace the query U i Wq in term(c). In this case, since the query vector is thesame for all query positions, it suggests that theattentive bias towards different words should remain the same regardless of the query position.With a similar reasoning, a trainable parameter v 2 Rd is added to substitute U i Wq in term(d).1Wvn nqn ,i Wk,R Ri jn u k ,j v Wk,RRinna Masked-Softmax(A )v n u Wk,E Exj v Wk,R Ri j .{z} {z} (c) SG(mn nnnqn , k , v h Areli,j Exi Wq Wk,E Exj Exi Wq Wk,R Ri j{z} {z} (a)1jnn 1on) LayerNorm(Linear(a ) h nnh Positionwise-Feed-Forward(o )with h0 : Es defined as the word embedding sequence. In addition, it is worth mentioning that a naive way to compute A requires comn Rputing Wk,Ri j for all pairs (i, j), whose costis quadratic w.r.t. the sequence length. However, noticing that the value of i j only rangesfrom zero to the sequence length, we show a simple computation procedure in Appendix B, whichreduces the cost to be linear w.r.t. the sequencelength.44.1ExperimentsMain ResultsWe apply Transformer-XL to a variety of datasetson both word-level and character-level language2982

Model#Param PPLGrave et al. (2016b) - LSTMBai et al. (2018) - TCNDauphin et al. (2016) - GCNN-8Grave et al. (2016b) - Neural cacheDauphin et al. (2016) - GCNN-14Merity et al. (2018) - QRNNRae et al. (2018) - Hebbian CacheOurs - Transformer-XL evski and Auli (2018) - Adaptive Input Ours - Transformer-XL Large247M257M20.518.3Table 1: Comparison with state-of-the-art results onWikiText-103. indicates contemporary 41.231.111.06Al-Rfou et al. (2018) - 64L TransformerOurs - 18L Transformer-XLOurs - 24L Transformer-XL235M88M277M1.061.030.99#Param bpcCooijmans et al. (2016) - BN-LSTMChung et al. (2016) - LN HM-LSTMZilly et al. (2016) - RHNKrause et al. (2016) - Large mLSTMAl-Rfou et al. (2018) - 12L Transformer35M45M45M44M1.361.291.271.271.18Al-Rfou et al. (2018) - 64L TransformerOurs - 24L Transformer-XL235M277M1.131.08Table 3: Comparison with state-of-the-art results ontext8.Model#Param PPLShazeer et al. (2014) - Sparse Non-Negative33B 52.9Chelba et al. (2013) - RNN-1024 9 Gram20B 51.3Kuchaiev and Ginsburg (2017) - G-LSTM-236.0Dauphin et al. (2016) - GCNN-14 bottleneck31.9Jozefowicz et al. (2016) - LSTM1.8B 30.6Jozefowicz et al. (2016) - LSTM CNN1.04B 30.0Shazeer et al. (2017) - Low-Budget MoE 5B 34.1Shazeer et al. (2017) - High-Budget MoE 5B 28.0Shazeer et al. (2018) - Mesh Tensorflow4.9B 24.0Baevski and Auli (2018) - Adaptive Input 0.46B 24.1Baevski and Auli (2018) - Adaptive Input 1.0B 23.7#Param bpcHa et al. (2016) - LN HyperNetworksChung et al. (2016) - LN HM-LSTMZilly et al. (2016) - RHNMujika et al. (2017) - FS-LSTM-4Krause et al. (2016) - Large mLSTMKnol (2017) - cmix v13Al-Rfou et al. (2018) - 12L TransformerOurs - 12L Transformer-XLModelOurs - Transformer-XL BaseOurs - Transformer-XL LargeTable 2: Comparison with state-of-the-art results on enwik8.modeling to have a comparison with state-of-theart systems, including WikiText-103 (Merity et al.,2016), enwik8 (LLC, 2009), text8 (LLC, 2009),One Billion Word (Chelba et al., 2013), and PennTreebank (Mikolov and Zweig, 2012).WikiText-103 is the largest available word-levellanguage modeling benchmark with long-term dependency. It contains 103M training tokens from28K articles, with an average length of 3.6K tokens per article, which allows testing the ability of long-term dependency modeling. We setthe attention length to 384 during training and1600 during evaluation. We adopted adaptive softmax and input representations (Baevski and Auli,2018; Grave et al., 2016a). As shown in Table 1,Transformer-XL reduces the previous state-of-theart (SoTA) perplexity from 20.5 to 18.3, whichdemonstrates the superiority of the TransformerXL architecture.The dataset enwik8 contains 100M bytes of unprocessed Wikipedia text. We compare our architecture with the previous results in Table 2.Under the model size constraint, the 12-layerTransformer-XL achieves a new SoTA result, out-0.46B 23.50.8B 21.8Table 4: Comparison with state-of-the-art results on OneBillion Word. indicates contemporary work.performing the 12-layer vanilla Transformer fromAl-Rfou et al. (2018) by 0.05, while both Transformer variants have a large margin over conventional RNN-based models. Notably, our 12-layerarchitecture achieves the same result as the 64layer network from Al-Rfou et al. (2018), usingonly 17% of the parameter budget. In order to seewhether better performances can be obtained byincreasing the model size, we train 18-layer and24-layer Transformer-XLs with increased modelsizes. With the attention length 784 during training and 3,800 during evaluation, we obtained anew SoTA result and our method is the first tobreak through 1.0 on widely-studied characterlevel benchmarks. Different from Al-Rfou et al.(2018), Transformer-XL does not need any auxiliary losses, and thus all benefits are credited to abetter architecture.Similar to but different from enwik8, text8 contains 100M processed Wikipedia characters created by lowering case the text and removing anycharacter other than the 26 letters a through z, andspace. Due to the similarity, we simply adapt thebest model and the same hyper-parameters on enwik8 to text8 without further tuning. The compari-2983

Model#Param PPLInan et al. (2016) - Tied Variational LSTMZilly et al. (2016) - Variational RHNZoph and Le (2016) - NAS CellMerity et al. (2017) - AWD-LSTMPham et al. (2018) - Efficient NASLiu et al. (2018) - Differentiable NASYang et al. (2017) - AWD-LSTM-MoSMelis et al. (2018) - Dropout tuningOurs - Transformer-XL†Merity et al. (2017) - AWD-LSTM FinetuneYang et al. (2017) - MoS 858.656.155.9755.324M54.5224M22M57.354.44Table 5: Comparison with state-of-the-art results onPenn Treebank. † indicates using two-step finetuning.son with previous methods is summarized in Table3. Again, Transformer-XL achieves the new SoTAresult with a clear margin.One Billion Word does not preserve any longterm dependency because sentences have beenshuffled. Consequently, this dataset mainly teststhe ability of modeling only short-term dependency. The comparison between Transformer-XLand the other methods is shown in Table 4. Although Transformer-XL is mainly designed to better capture longer-term dependency, it dramatically improves the single-model SoTA from 23.7to 21.8. Specifically, Transformer-XL significantly outperforms a contemporary method usingvanilla Transformers (Baevski and Auli, 2018),suggesting the advantage of Transformer-XL isgeneralizable to modeling short sequences.We also report the results on word-level PennTreebank in Table 5. Similar to AWD-LSTM(Merity et al., 2017), we apply variational dropoutand weight average to Transformer-XL. Withproper regularization, Transformer-XL achieves anew SoTA result among models without two-stepfinetuning. Penn Treebank has only 1M trainingtokens, which implies that Transformer-XL alsogeneralizes well even on small datasets.4.2Ablation StudyWe conduct two sets of ablation studies to examine the effects of two proposed techniques used inTransformer-XL: the recurrence mechanism andthe new positional encoding scheme.The first study is performed on WikiText-103,which requires modeling long-term dependency.The results are reported in Table 6. Among thecompared encoding schemes, Shaw et al. (2018) isrelative, while Vaswani et al. (2017) and Al-Rfouet al. (2018) are absolute. “Full” and “half” lossesrefer to applying a cross entropy loss to all or therecent half positions in the segment. We foundthat absolute encodings only work well with halflosses because half losses exclude positions withvery short attention lengths during training for better generalization. Table 6 shows that both therecurrence mechanism and our encoding schemeare necessary to achieve the best performance, aswell as generalizing to longer attention sequencesduring evaluation time. Although the backpropagation length during training is only 128, withthe two techniques the attention length can be increased to 640 at test time. In the standard settingwith 151M parameters, the perplexity decreases asthe attention length increases.Since the recurrence mechanism costs additional memory, we also compare Transformer-XLwith baselines under the same GPU memory constraints. As shown in Table 10 in Appendix A,despite using a shorter backpropagation length,Transformer-XL remains superior to the baselines.The second study targets at isolating the effects of resolving the context fragmentation problem from the benefit of capturing longer contextlength. In order to achieve this goal, we deliberately choose a dataset that does not require longterm dependency, so that any improvement fromestablishing the recurrence can be attributed tosolving the context fragmentation. Specifically,we perform this controlled experiment on the OneBillion Word dataset, which can only benefit fromremoving the context fragmentation. We traina 20-layer Transformer-XL with 0.3B parameters for 400K steps. As shown in Table 7, usingsegment-level recurrence substantially improvesperformance even when long-term dependency isnot needed, which is consistent with our previousdiscussion that the recurrence mechanism resolvesthe context fragmentation problem. Moreover, ourrelative positional encodings is also superior toShaw et al. (2018) on short sequences.4.3Relative Effective Context LengthKhandelwal et al. (2018) proposed a method toevaluate the Effective Context Length (ECL) of asequence model. ECL is the longest length towhich increasing the context span would lead toa gain more than a threshold. However, ECL ignores the fact that it is harder to get improvement when a model already achieves a lower per-2984

RemarkRecurrenceEncodingLossPPL initPPL bestAttn LenTransformer-XL (128M)-33377OursShaw et al. 120Transformer (128M)†7777Shaw et al. (2018)Shaw et al. (2018)Vaswani et al. (2017)Al-Rfou et al. 5030.9731.16120120120120Transformer-XL (151M)3OursFull23.4323.0923.1623.35640450300Table 6: Ablation study on WikiText-103. For the f

the-art results of bpc/perplexity to 0.99 on en-wiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL man-ages to generate reasonably coherent, novel text articles with thousands of toke