Skip to main content

Tutorial: Fundamentals of Transformers: A Signal-processing View

Christos Thrampoulidis, Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi

  • SPS
    Members: $10.00
    IEEE Members: $22.00
    Non-members: $30.00
Tutorial 23 Oct 2024

Part I: Motivation and Overview I.1 The Transformer Revolution: Our tutorial begins by providing an in-depth account of the Transformer architecture and its extensive array of applications. We place special emphasis on examples most relevant to the signal-processing audience, including speech analysis, time-series forecasting, image processing, and most recently, wireless communication systems. Additionally, we introduce and review essential concepts associated with Transformers' training, such as pre-training, fine-tuning, and prompt-tuning, while also discussing the Transformers' emerging abilities, such as in-context learning and reasoning. I.2 A Signal-Processing-Friendly Introduction to the Attention Mechanism: We then dive into a comprehensive explanation of the Transformer block's structure. Our primary focus is on the Attention mechanism, which serves as the fundamental distinguishing feature from conventional architectures like fully connected, convolutional, and residual neural networks. To facilitate the signal-processing community's understanding, we introduce a simplified attention model that establishes an intimate connection with problems related to sparse signal recovery and matrix factorization. Using this model as a basis, we introduce critical questions regarding its capabilities in memorizing lengthy sequences, modeling long-range dependencies, and training effectively. Part II: Efficient Inference and Adaptation: Quadratic attention bottleneck and Parameter-efficient tuning (PET) II.1 Kernel viewpoint, low-rank/sparse approximation, Flash-attn (system level, implementation): Transformers struggle with long sequences due to quadratic self-attention complexity. We review recently-proposed efficient implementations aimed to tackle this challenge, while often achieving superior or comparable performance to vanilla Transformers. First, we delve into approaches that approximate quadratic-time attention using data-adaptive, sparse, or low-rank approximation schemes. Secondly, we overview the importance of system-level improvements, such as FlashAttention, where more efficient I/O awareness can greatly accelerate inference. Finally, we highlight alternatives which replace self-attention with more efficient problem-aware blocks to retain performance. II.2 PET: Prompt-tuning, LoRa adapter (Low-rank projection): In traditional Transformer pipelines, models undergo general pre-training followed by task-specific fine-tuning, resulting in multiple copies for each task, increasing computational and memory demands. Recent research focuses on parameter-efficient fine-tuning (PET), updating a small set of task-specific parameters, reducing memory usage, and enabling mixed-batch inference. We highlight attention mechanisms' key role in PET, discuss prompt-tuning, and explore LoRA, a PET method linked to low-rank factorization, widely studied in signal processing. II.3 Communication and Robustness gains in Federated Learning: We discuss the use of large pretrained transformers in mobile ML settings with emphasis on federated learning. Our discussion emphasizes the ability of transformers to adapt in a communication efficient fashion via PET methods: (1) Use of large models shrinks the accuracy gaps between alternative approaches and improves heterogeneity robustness. Scaling allows clients to run more local SGD epochs which can significantly reduce the number of communication rounds. (2) PET methods, by design, enable >100× less communication in bits while potentially boosting robustness to client heterogeneity and small sample size. BREAK I Part III: Approximation, Optimization, and Generalization Fundamentals III.1 Approximation and Memorization Abilities: We discuss Transformers as sequence-to-sequence models with a fixed number of parameters, independent of sequence length. Despite parameter sharing, Transformers exhibit universal approximation capabilities for sequence-to-sequence tasks. We delve into key results regarding Transformer models' approximation abilities, examining the impact of depth versus width. We also address their memorization capacity, emphasizing the trade-off between model size and the number of memorized sequence-to-sequence patterns. Additionally, we discuss the link between Transformers and associative memories, a topic of interest within the signal processing community. III.2 Optimization dynamics: Transformer as Support Vector Machines: In this section, we present a fascinating emerging theory that elucidates how the attention layer learns, during training, to distinguish 'good' sequence elements (those most relevant to the prediction task) while suppressing 'bad' ones. This separation is formally framed as a convex optimization program, similar to classical support-vector machines (SVMs), but with a distinct operational interpretation that relates to the problems of low-rank and sparse signal recovery. This unique formulation allows us to engage the audience with a background in signal processing, as it highlights an implicit preference within the Transformer to promote sparsity in the selection of sequence elements—a characteristic reminiscent of traditional sparsity-selection mechanisms such as the LASSO. III.3 Generalization dynamics: Our discussion encompasses generalization aspects related to both the foundational pretraining phase and subsequent task performance improvements achieved through prompt tuning. To enhance our exploration, we will introduce statistical data models that extend traditional Gaussian mixture models, specifically tailored to match the operational characteristics of the Transformer. Our discussion includes an overview and a comprehensive list of references to a set of tools drawn from high-dimensional statistics and recently developed learning theories concerning the neural tangent kernel (NTK) and the deep neural network's feature learning abilities. BREAK II Part IV: Emerging abilities, in-context learning, reasoning IV.1 Scaling laws and emerging abilities: We begin the last part of the tutorial by exploring the intriguing world of scaling laws and their direct implications on the emerging abilities of Transformers. Specifically, we will delve into how these scaling laws quantitatively impact the performance, generalization, and computational characteristics of Transformers as they increase in size and complexity. Additionally, we draw connections between the scaling laws and phase transitions, a concept familiar to the signal processing audience, elucidating via examples in the literature how Transformers' behavior undergoes critical shifts as they traverse different scales. IV.2 In-context learning (ICL): Transformers as optimization algorithms We delve into the remarkable capability of ICL, which empowers Transformers to engage in reasoning, adaptation, and problem-solving across a wide array of machine learning tasks through the use of straightforward language prompts, closely resembling human interactions. To illustrate this intriguing phenomenon, we will provide concrete examples spanning both language-based tasks and mathematically structured, analytically tractable tasks. Furthermore, we present findings that shed light on an intriguing perspective of in-context learning: the Transformer's capacity to autonomously learn and implement gradient descent steps at each layer of its architectural hierarchy. In doing so, we establish connections to deep-unfolding techniques, which have garnered popularity in applications such as wireless communications and solving inverse problems. IV.3 Primer on Reasoning: The compositional nature of human language allows us to express fine-grained tasks/concepts. Recent innovations such as prompt-tuning, instruction-tuning, and various prompting algorithms are enabling the same for language models and catalyzing their ability to accomplish complex multi-step tasks such as mathematical reasoning or code generation. Here, we first introduce important prompting strategies that catalyze reasoning such as chain-of-thought, tree-of-thought, and self-evaluation. We then demonstrate how these methods boost reasoning performance as well as the model’s ability to evaluate its own output, contributing to trustworthiness. Finally, by building on the ICL discussion, we introduce mathematical formalisms that shed light on how reasoning can be framed as “acquiring useful problem solving skills” and “composing these skills to solve new problems”. Conclusions, outlook, and open problems We conclude the tutorial by going over a list of important and exciting open problems related to the fundamental understanding of Transformer models, while emphasizing how this research creates opportunities for enhancing architecture and improving algorithms & techniques. This will bring the audience to the very forefront of fast-paced research in this area.

More Like This

  • PELS
    Members: Free
    IEEE Members: $8.00
    Non-members: $12.00
  • PELS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • PELS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00