Tucano: Advancing Neural Text Generation for Portuguese
Main Author: Dr. KLUGE CORRÊA, Nicholas (Center for Science and Thought, Uni Bonn)
Co-Authors: SEN, Aniket (HPC/A-Lab, Uni Bonn); FALK, Sophia (Bonn Sustainable AI Lab, IWE, Uni Bonn); FATIMAH, Shiza (Institute for Computer Science, Uni Bonn)
Contact e-mail: nklugeco@uni-bonn.de

Abstract
Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. In our submitted work, we introduced “Tucano: Advancing Neural Text Generation for Portuguese” to counter such unbalanced progress. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The Tucano models were trained on Uni Bonn’s Marvin cluster, leveraging its state-of-the-art hardware capabilities. We utilized approximately 5,900 GPU hours through the Highly Scalable GPU Partition. Regarding training speed, our experiments achieved a performance (around 55% Model Flop Utilization) that matches or surpasses similar works. This highlights Marvin’s capacity to support cutting-edge research efficiently and competitively. What is the most interesting point? The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when assessing Portuguese generative language models.
All derivatives of our study are openly released: https://nkluge-correa.github.io/Tucano/. Our preprint is available on arXiv (2411.07854), and the paper is being reviewed.