Summary of “Perceiver: General Perception with Iterative Attention” PDF

Title	Summary of “Perceiver: General Perception with Iterative Attention”
Author	HAN LIU
Course	Advanced Topics in Continual / Organic Machine Learning
Institution	Karlsruher Institut für Technologie
Pages	1
File Size	41.6 KB
File Type	PDF
Total Downloads	99
Total Views	151

Preview

CLICK TO PREVIEW PDF

Summary

In general, Perceiver stacks multiple modules of cross-attention and transformer to continuously and iteratively mine the input information. This paper also uses Fourier coding to maintain the location information....

Description

Summary of “Perceiver: General Perception with Iterative Attention” The Perceiver model consists of two major infrastructures: 1. cross-attention is used to fuse the byte array with the previous latent array, and the output yields the latent array. This module introduces asymmetry, that is, K and V vectors from byte array and V vectors from latent array, such that the complexity of the attention mechanism is O(MN), making it possible for the model to directly process data of original size; 2. latent transformer maps a latent array to a latent array. The latent array is projected from the previous feature space to another feature space. The length of byte array is the length of the input data, for example, image speech, etc., which is usually very long. Latent array is a hyperparameter with self-defined length, which needs to be much smaller than byte array to achieve the effect of dimensionality reduction. To summarize, latent array provides Q vectors, which can be seen as queries, and the KV vectors extracted from byte array are queried multiple times through iteration. Cross-attention achieves to reduce the very large original input to an acceptable dimension. Then the latent transformer allows the latent array's own elements to continuously interact with each other. With these two structures the information of the byte array can eventually be represented in a smaller dimensional latent array, with a complexity of O(MN) 2

+O( N ). In general, Perceiver stacks multiple modules of cross-attention and transformer to continuously and iteratively mine the input information. This paper also uses Fourier coding to maintain the location information. The achievements of this paper mainly include 3 points: 1. It enables the layers of transformer to be deeper, and 48 layers can be stacked in the experiment; 2. It can process data of multiple modalities without changing the model structure, and the experiments are tested on image, audio, video, and point cloud data; 3. Introduction of asymmetric attention model. The fusion of input data and latent array allows the model to process inputs of very large size, or even directly process the original image with 50,000 pixels without convolution....