Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian Processes

ICML 2024

Yingyi Chen* 1Qinghua Tao* 1Francesco Tonin2Johan A.K. Suykens1

1ESAT-STADIUS, KU Leuven, Belgium      2LIONS, EPFL, Switzerland

*Equal contribution 

workflow.jpg

Illustration of canonical self-attention and our KEP-SVGP in one layer. (a) The attention kernel K_{\rm att} in canonical self-attention is induced by two different feature maps \phi_q,\phi_k related to queries and keys; hence K_{\rm att} is in essence asymmetric. (b) KEP-SVGP consists of one SVGP pair induced by the two sets of projection outputs based on \phi_q,\phi_k from KSVD w.r.t. K_{\rm att}, which fully characterizes the asymmetry of self-attention in the posterior. The posteriors are now approximated based on the inversion of a diagonal matrix \Lambda containing top s singular values, thereby of time complexity \mathcal{O}(s).

arXiv Paper Code Video Poster

Abstract


While the great capability of Transformers significantly boosts prediction accuracy, it could also yield overconfident predictions and require calibrated uncertainty estimation, which can be commonly tackled by Gaussian processes (GPs). Existing works apply GPs with symmetric kernels under variational inference to the attention kernel; however, omitting the fact that attention kernels are in essence asymmetric. Moreover, the complexity of deriving the GP posteriors remains high for large-scale data. In this work, we propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention where the asymmetry of attention kernels is tackled by Kernel SVD (KSVD) and a reduced complexity is acquired. Through KEP-SVGP, i) the SVGP pair induced by the two sets of singular vectors from KSVD w.r.t. the attention kernel fully characterizes the asymmetry; ii) using only a small set of adjoint eigenfunctions from KSVD, the derivation of SVGP posteriors can be based on the inversion of a diagonal matrix containing singular values, contributing to a reduction in time complexity; iii) an evidence lower bound is derived so that variational parameters and network weights can be optimized with it. Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.

Background


Gaussian Processes

A Gaussian Process (GP) represents a distribution, denoted by , over real-valued functions defined on an input domain . We have the prior process as follows:

where and the GP is with a symmetric positive-definite covariance function parameterized by a kernel function . The posterior process of is with time complexity , due to the inversion of the kernel matrix.

Sparse Variational Gaussian Processes

Sparse variational Gaussian processes (SVGP) use inducing points to represent the whole dataset,

where the variational posterior is based on with being the variational distribuition of the inducing points. The posteior is now with time complexity .

SVGPs with Kernel-Eigen Features

We can further reduce the time compleixty of the posterior process by choosing kernel-eigen features as the inducing variables. Define as the -th eigenfunction of the kernel with eigenvalue , then we have

where is the eigenvalue matrix, and are the eigenvectors. The posteior is now with time complexity .

Canonical Self-Attention is with Asymmetric Kernel

Let \{\boldsymbol{x}_i\in\mathbb{R}^d\}_{i=1}^N be the input data sequence. In self-attention, the queries, keys and values output the linear projections of the input sequence:

q(\boldsymbol{x}_i) = W_q \boldsymbol{x}_i,
    \quad
    k(\boldsymbol{x}_i) = W_k \boldsymbol{x}_i,
    \quad
    v(\boldsymbol{x}_i) = W_v \boldsymbol{x}_i.

The canonical self-attention is with "softmax" activation applied for bringing non-linearity and positives, yielding the attention weights:

\kappa_\text{att} (\boldsymbol{x}_i, \boldsymbol{x}_j)= \text{softmax} \left( \left< W_q \boldsymbol{x}_i, W_k \boldsymbol{x}_j\right> / \sqrt{d_k} \right), \quad i,j=1,\ldots,N,

where \kappa_\text{att}(\cdot,\cdot):\mathbb{R}^d\times\mathbb{R}^d\mapsto \mathbb{R} serves as the kernel function. Notice that in general, \left< W_q \boldsymbol{x}_i, W_k \boldsymbol{x}_j\right> \neq \left< W_q \boldsymbol{x}_j, W_k \boldsymbol{x}_i\right>, leading to an asymmeyric kernel where . Please refer to Primal-Attention (NeurIPS 2023) for more details.

Self-Attention with Kernel SVD

Canonical self-attention can be represented by the dual representation of the kernel SVD (KSVD) problem. Please refer to Primal-Attention (NeurIPS 2023) for more details.

Remark 3.3 (Primal-dual representations of KSVD in self-attention). In the KSVD formulations for the asymmetric kernel matrix in self-attention, with KKT conditions, the projection scores can be either represented in the primal using explicit feature maps or in the dual using kernel functions:

where , are dual variables with column-wisely the left and right singular vectors of the attention matrix .

Method


Pair of Adjoint Eigenfunctions for Self-Attention

Self-Attention corresponds to a shifted eigenvalue problem w.r.t. the attention matrix [Primal-Attention (NeurIPS 2023)]:

leading to the two eigendecompositions w.r.t. symmetric kernel , :

corresponding to two SVGPs with adjoint kernel-eigen features

Kernel-Eigen Pair Sparse Variational Gaussian Processes

The outputs of the two SVGPs are obtained by the Monte-Carlo sampling:

where are the Cholesky factor of respectively. Finally, we merge the outputs of the two SVGPs either by addition or concatenation schemes:

Then we have . The final outputs of the self-attention are

Results


Please refer to our paper for more experiments.

Uncertainty Awareness on In-distribution Data

low-rank.jpg

Resources


arXiv

paper.jpg

Paper

openreview.jpg

Code

github_repo.jpg

Poster

github_repo.jpg

BibTeX

If you find this work useful for your research, please consider citing:
          @inproceedings{chen2024self,
            title={Self-Attention through Kernel-Eigen Pair Sparse Variational Gaussian Processes},
            author={Chen, Yingyi and Tao, Qinghua and Tonin, Francesco and Suykens, Johan A.K.},
            booktitle={International Conference on Machine Learning},
            year={2024}
          }

Acknowledgements


This work is jointly supported by the European Research Council under the European Union’s Horizon 2020 research and innovation program/ERC Advanced Grant E-DUALITY (787960), iBOF project Tensor Tools for Taming the Curse (3E221427), Research Council KU Leuven: Optimization framework for deep kernel machines C14/18/068, KU Leuven Grant CoE PFV/10/002, The Research Foundation–Flanders (FWO) projects: GOA4917N (Deep Restricted kernel Machines: Methods and Foundations), Ph.D./Postdoctoral grant, the Flemish Government (AI Research Program), EU H2020 ICT-48 Network TAILOR (Foundations of Trustworthy AI-Integrating Reasoning, Learning and Optimization), Leuven.AI Institute.

© This webpage was in part inspired from this template.