Authors:
(1) Hung Le, Applied AI Institute, Deakin University, Geelong, Australia;
(2) Dung Nguyen, Applied AI Institute, Deakin University, Geelong, Australia;
(3) Kien Do, Applied AI Institute, Deakin University, Geelong, Australia;
(4) Svetha Venkatesh, Applied AI Institute, Deakin University, Geelong, Australia;
(5) Truyen Tran, Applied AI Institute, Deakin University, Geelong, Australia.
Table of Links
Related Works, Discussion, & References
3.2 Dyck Language Recognition
3.3 Compositional Learning
SCAN In this task , one needs to map an input sentence into an output sequence of commands [Lake and Baroni, 2018]. The sequences are compositional, consisting of reusable parts. For example, in one case, “jump twice” should be mapped to “JUMP JUMP” and in another, “walk twice” becomes “WALK WALK”. We focus on the “length split” datasets where the training sequences are shorter than the test ones with 11 length modes L = 22, 24, .., 40 [Newman et al., 2020]. We adopt the benchmark, training procedure and baselines prepared by Csord´as et al. [2021], which achieves strong results under standard s2s learning. Here, our aim is not to break SOTA, which can be achieve by hybrid-symbolic architectures [Chen et al., 2020, Shaw et al., 2021]. Instead, we focus on improving Transformer generalization in this task, hence the baselines are chosen as several variants of Transformers (TRM) targeted to sequence extrapolation, including those using Relative Positional Encoding (RPE [Dai et al., 2019]) and Universal Transformer (U. TRM [Dehghani et al., 2018]), which is an advanced Transformer variant that recurrently processes each token, and can dynamically adjust the number of processing steps. Following Csord´as et al. [2021], each baseline is trained 5 times for 50K steps and the resulting model after training is used for evaluation (no validation). Here, we use Transformer as the Encoder, which is the same as the TRM, and stack the Controller to another Transform decoder (see details in Appendix D.3). Hence, the only difference is the decoding where PANM leverages pointer manipulation.
Table 2 shows that PANM outperforms other baselines in the hardest settings when the training length is up-to 22, 24, and 25. For 22 and 24 cases, general models like PANM cannot show perfect generalization because some testing compositions is entirely discarded from the train set. In easier settings, PANM shares the perfect median accuracy with the sophisticated U. TRM + RPE although it does not use RPE. Remarkably, despite sharing the same encoder, TRM performs much worse than PANM and even fails to learn in easy modes (33, 36, 40), indicating the importance of pointer handling in this testbed. One problem for other baselines is the EOS decision (when to generate ending token), which requires length tracking [Newman et al., 2020]. As they do not have content free sequence iteration mechanisms, it is extremely hard to trace the length without overfitting to the training data. On the other hand, PANM can hypothetically generate pointer incrementally and capture the difference between the last and the first pointers, i.e. the input length, and infer the output sequence length based on that information.
Mathematical Problems We test our model on mathematics [Saxton et al., 2018] where the input/output are questions and answers about math and each token is a character. For example, What is − 5 − 110911? → −110916 (add or sub) and What is the hundreds digit of 31253? → 2 (place value). The task requires not only math reasoning, but also natural language understanding. We follow the training from Csord´as et al. [2021] to conduct experiments on 2 subsets: add or sub (a.s) and place value (p.v), and compare our method with Transformer-based baselines. Here, we focus on the extrapolating test set involving larger numbers, more numbers, more compositions, and thus longer input sequences than the training. We use TRM + RPE as the Encoder and the Controller is added to a normal TRM decoder. As shown in Table 2, on place value, PANM does not suffer from performance crash as TRM + RPE (0% test accuracy, as admitted in the paper [Csord´as et al., 2021] even though it uses the same encoder). PANM achieves similar results as U. TRM+ RPE on add or sub while outperforming it by 11% on place value. We also examine PANM with the original Transformer and report additional results in Appendix D.3.
3.4 Other NLP Tasks
Question Answering Our objective is to explore the PANM’s generalization beyond obviously compositional data by applying it in a more practical setting of question answering. For this purpose, we utilize two datasets, namely bAbI [Weston et al., 2015] and SQUAD 1.1 [Rajpurkar et al., 2016] where the input sequence is a context paragraph and a question, and the output is the answer. To add complexity to the task, we ensure the length of test sequence is greater than that of the training by sorting the context paragraph by length and splitting the sorted data into 0.8/0.2 and 0.5/0.5 ratio. Details of the data/task are in Appendix D.4. In bAbI, we configure the PANM similarly to the one described in § 3.3 using Transformer backbone, and test the models after 100-epoch training. The models predict the answer tokens given the context and question tokens. As shown in Table 3 and Appendix Fig. 5 (right), PANM helps Transformer generalize better, consistently improving around 6% and 5% using 0.8/0.2 and 0.5/0.5 splits, respectively. Notably, PANM’s testing loss is not diverged quickly as Transformer’s, indicating PANM’s capability of reducing overfitting. In SQUAD, we use BERT as the backbone to predict the start and the end of the answer as in Kenton and Toutanova [2019]. PANM-assisted model outperforms the baselines by 1% and 2% exact match accuracy, respectively (Table 4). The improvement is significant as BERT is a big foundation model already pretrained with big data and robust against novel test data.
Machine Translation Here, we want to verify the PANM in machine translation and show that PANM can work with different number layers of Transformer. The results are presented in Fig. 3 (b) where we report the model perplexity on Multi30K (en-de) dataset. The 30K-sample dataset is sorted by input length and split into training and testing s.t. testing sequences are longer, similar to QA task. The results demonstrate PANM can consistently improve the generalization performance of Transformer across different split ratios and the number of encoder/decoder layers.
This paper is available on arxiv under CC BY 4.0 DEED license.