Turning AI Into Better Thinkers With Pointer-Based Memory

cover
1 Apr 2025

Authors:

(1) Hung Le, Applied AI Institute, Deakin University, Geelong, Australia;

(2) Dung Nguyen, Applied AI Institute, Deakin University, Geelong, Australia;

(3) Kien Do, Applied AI Institute, Deakin University, Geelong, Australia;

(4) Svetha Venkatesh, Applied AI Institute, Deakin University, Geelong, Australia;

(5) Truyen Tran, Applied AI Institute, Deakin University, Geelong, Australia.

Abstract & Introduction

Methods

Methods Part 2

Experimental Results

Experimental Results Part 2

Related Works, Discussion, & References

Appendix A, B, & C

Appendix D

There are many attempts to augment neural networks with external memory (MANN) to improve their symbol-processing ability. Pioneers such as NTM [Graves et al., 2014] and DNC [Graves et al., 2016] propose computer-like memory read/write operations with content-based attention mechanisms, and thus in principle, can execute any symbolic rules. However, learning the hidden law end-to-end from sequence data is extremely difficult. Therefore, MANNs including Transformers [Vaswani et al., 2017], may fail miserably in out-of-distribution testbeds, especially length extrapolation [Del´etang et al., 2022]. Recent LLMs are good at reasoning and generalization, but bad at symbolic processing [Qian et al., 2023, Tang et al., 2023]. We use LLMs only to show our task difficulty (Appendix D.6), not as a baseline, because they are not on the same scale as our method.

Many recent works advocate the use of specialized memory architectures such as stacks [Grefenstette et al., 2015, Hao et al., 2018, Suzgun et al., 2019], key-value memory [Webb et al., 2020, Le et al., 2020a] and improved attentions [Kurach et al., 2015, Russin et al., 2019, Dubois et al., 2020]. These methods employ different inductive biases in designing the memory and attention, yet not following the two principles advocated by our paper. Although they may work remarkably on certain synthetic tasks, they are not examined on various benchmarks or compatible with different sequential backbones. Other orthogonal approaches focus on model initialization [Zhang et al., 2019], data augmentation [Andreas, 2020] or training details [Csord´as et al., 2021]. Besides differentiable models, there are major progress in compositional rule learning that leverage neuro-symbolic architectures [Nye et al., 2020, Shaw et al., 2021, Chen et al., 2020] or reinforcement learning [Liu et al., 2020]. We have not compared our model with these task-specific methods, as our focus is on improving the systematic generalization of fundamental differentiable models.

Our approach is mainly related to key-value memory because the address bank can be viewed as the key and the data memory as the value. However, the key in other works is either learned through backpropagation [Le et al., 2020a] or computed based on the input data [Webb et al., 2020]. In contrast, our “keys” are generated as fixed numbers (physical memory addresses– § 1’s principle I), which is totally separated from the data and extendable to longer sequences. We argue that using addresses as keys is critical to symbol processing because it explicitly allows pointer assignment, dereference and arithmetic. A related generalization-enable scheme is to design positional encoding of tokens in a sequence [Vaswani et al., 2017, Dai et al., 2019, Li and McClelland, 2022]. Unlike these approaches, our physical addresses are detached from the data to support transforming pointers through time steps and isolating pointer manipulation from the input.

5 Discussion

We introduce a neural memory model called PANM that manipulates pointers to learn symbol processing rules for better length extrapolation. PANM isolates symbols from data and uses an address bank to allow data-isolated pointer manipulation through address attention. PANM consistently outperforms strong baselines in tasks such as algorithm mining, compositional learning, mathematics reasoning, context-free grammar recognition, and practical NLP tasks, even when the test sequence is much longer than the training sequence. Reproducibility In the Appendix, we included our detailed model descriptions, algorithms, implementations, hyperparameters to replicate the results of our experiments. Source code will be released when the paper is published.

Impact Statements In this work, we used the publicly available datasets for experiments. We did not collect human or animal data during this study. Our work aims to improve the generalization of sequential models. This aim is genuine, and we do not think there are immediate harmful consequences. However, we are aware of potential problems if our method is used to augment language models to generate texts that are hallucinated or contain negative contents. This issue is typical for plug-and-play modules like PANM, and we will do our best to prevent it from our end.

References

Jacob Andreas. Good-enough compositional data augmentation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7556–7566, 2020.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic generalization: What is required and can it be learned? In International Conference on Learning Representations, 2018.

Xinyun Chen, Chen Liang, Adams Wei Yu, Dawn Song, and Denny Zhou. Compositional generalization via neural-symbolic stack machines. Advances in Neural Information Processing Systems, 33:1690–1701, 2020.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

R´obert Csord´as, Kazuki Irie, and J¨urgen Schmidhuber. The devil is in the detail: Simple tricks improve systematic generalization of transformers. arXiv preprint arXiv:2108.12284, 2021.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019.

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2018.

Gr´egoire Del´etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, and Pedro A Ortega. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098, 2022.

Yann Dubois, Gautier Dagan, Dieuwke Hupkes, and Elia Bruni. Location attention for extrapolation to longer sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 403–413, 2020.

Aaron Eisermann, Jae Hee Lee, Cornelius Weber, and Stefan Wermter. Generalization in multimodal language learning from simulation. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.

Tong Gao, Qi Huang, and Raymond Mooney. Systematic generalization on gscan with language conditioned embedding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 491–503, 2020.

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.

Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwi ´nska, Sergio G ´omez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538 (7626):471–476, 2016.

Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with unbounded memory. Advances in neural information processing systems, 28, 2015.

Yiding Hao, William Merrill, Dana Angluin, Robert Frank, Noah Amsel, Andrew Benz, and Simon Mendelsohn. Context-free transductions with neural stacks. EMNLP 2018, page 306, 2018.

Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.

Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, and Roger P Levy. A systematic assessment of syntactic generalization in neural language models. arXiv preprint arXiv:2005.03692, 2020.

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.

Asjad Khan, Hung Le, Kien Do, Truyen Tran, Aditya Ghose, Hoa Dam, and Renuka Sindhgatta. Deepprocess: supporting business process execution using a mann-based recommender system. In Service-Oriented Computing: 19th International Conference, ICSOC 2021, Virtual Event, November 22–25, 2021, Proceedings 19, pages 19–33. Springer, 2021.

Trenton E. Kriete, David C. Noelle, Jonathan D. Cohen, and Randall C. O’Reilly. Indirection and symbol-like processing in the prefrontal cortex and basal ganglia. Proceedings of the National Academy of Sciences, 110:16390 – 16395, 2013.

Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. arXiv preprint arXiv:1511.06392, 2015.

Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.

Brenden M Lake. Compositional generalization through meta sequence-to-sequence learning. Advances in neural information processing systems, 32, 2019.

Hung Le and Svetha Venkatesh. Neurocoder: General-purpose computation using stored neural programs. In International Conference on Machine Learning, pages 12204–12221. PMLR, 2022.

Hung Le, Truyen Tran, and Svetha Venkatesh. Dual memory neural computer for asynchronous two-view sequential learning. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1637–1645, 2018.

Hung Le, Truyen Tran, and Svetha Venkatesh. Neural stored-program memory. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum?id=rkxxA24FDr.

Hung Le, Truyen Tran, and Svetha Venkatesh. Self-attentive associative memory. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5682–5691, Virtual, 13–18 Jul 2020b. PMLR.

Yuxuan Li and James McClelland. Systematic generalization and emergent structures in transformers trained on structured tasks. In NeurIPS ’22 Workshop on All Things Attention: Bridging Different Perspectives on Attention, 2022. URL https://openreview.net/forum?id=BTNaKmYdQmE.

Qian Liu, Shengnan An, Jian-Guang Lou, Bei Chen, Zeqi Lin, Yan Gao, Bin Zhou, Nanning Zheng, and Dongmei Zhang. Compositional generalization by learning analytical expressions. Advances in Neural Information Processing Systems, 33:11416–11427, 2020.

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.

Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The eos decision and length extrapolation. In BlackBoxNLP@EMNLP, 2020. URL https://nlp.stanford.edu/pubs/newman2020extrapolation.pdf.

Maxwell Nye, Armando Solar-Lezama, Josh Tenenbaum, and Brenden M Lake. Learning compositional rules via neural program synthesis. Advances in Neural Information Processing Systems, 33:10832–10842, 2020.

Jing Qian, Hong Wang, Zekun Li, Shiyang Li, and Xifeng Yan. Limitations of language models in arithmetic and symbolic induction. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9285–9298. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.516. URL https://doi.org/10.18653/v1/2023.acl-long.516.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016.

Jake Russin, Jason Jo, Randall C O’Reilly, and Yoshua Bengio. Compositional generalization in a deep seq2seq model by separating syntax and semantics. arXiv preprint arXiv:1904.09708, 2019.

David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2018.

Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and Kristina Toutanova. Compositional generalization and natural language variation: Can a semantic parsing approach handle both? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 922–938, 2021.

Ray J Solomonoff. Algorithmic probability, heuristic programming and agi. In 3d Conference on Artificial General Intelligence (AGI-2010), pages 57–63. Atlantis Press, 2010.

Mirac Suzgun, Sebastian Gehrmann, Yonatan Belinkov, and Stuart M Shieber. Memory augmented recurrent neural networks can learn generalized dyck languages. arXiv preprint arXiv:1911.03329, 2019.

Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang. Large language models are in-context semantic reasoners rather than symbolic reasoners. arXiv preprint arXiv:2305.14825, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. Advances in neural information processing systems, 28, 2015.

John Von Neumann. First draft of a report on the edvac. IEEE Annals of the History of Computing, 15(4):27–75, 1993.

Taylor Whittington Webb, Ishan Sinha, and Jonathan Cohen. Emergent symbols through binding in external memory. In International Conference on Learning Representations, 2020.

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart Van Merri¨enboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.

Bo Wu, Haoyu Qin, Alireza Zareian, Carl Vondrick, and Shih-Fu Chang. Analogical reasoning for visually grounded language acquisition. arXiv preprint arXiv:2007.11668, 2020.

Greg Yang. Lie access neural turing machine. arXiv preprint arXiv:1602.08671, 2016.

Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. Learning the dyck language with attention-based seq2seq models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 138–146, 2019.

Biao Zhang, Ivan Titov, and Rico Sennrich. Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 898–909, 2019.

This paper is available on arxiv under CC BY 4.0 DEED license.