논문 번역 블로그

Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." Advances in neural information processing systems 35 (2022): 23716-23736.

Flamingo: a Visual Language Model for Few-Shot Learning

DeepMind

Abstract

새로운 task에 대해 소수의 annotated example만으로도 빠르게 적응할 수 있는 모델을 구축하는 것은 멀티모달 머신러닝 연구에서 여전히 해결되지 않은 도전 과제이다. 우리는 이러한 능력을 갖춘 Visual Language Model(VLM) 계열인 Flamingo를 소개한다. 우리는 다음과 같은 주요 아키텍처 혁신을 제안한다: (i) 강력한 사전학습된 vision-only 모델과 language-only 모델을 연결하는 구조, (ii) 시각 및 텍스트 데이터가 임의로 섞여 있는 시퀀스를 처리할 수 있는 구조, (iii) 이미지 또는 비디오를 입력으로 자연스럽게 수용할 수 있는 구조.

이러한 유연성 덕분에 Flamingo 모델은 텍스트와 이미지가 임의로 섞여 있는 대규모 멀티모달 웹 코퍼스를 학습에 활용할 수 있으며, 이는 in-context few-shot learning 능력을 갖추는 데 핵심적인 요소이다.

우리는 다양한 이미지 및 비디오 task에 대한 Flamingo 모델의 빠른 적응 능력을 탐구하고 측정하며, 철저한 평가를 수행한다. 여기에는 다음과 같은 open-ended task가 포함된다:

Visual Question Answering (VQA): 모델이 질문을 받고 그에 대한 답을 생성해야 하는 task,
Captioning: 장면이나 이벤트를 설명하는 능력을 평가하는 task.

또한 다음과 같은 close-ended task도 포함된다:

Multiple-choice VQA: 여러 선택지 중 정답을 고르는 형식의 task.

이 스펙트럼 어디에 위치한 task든지, 단일 Flamingo 모델은 task-specific 예시를 prompt로 주는 것만으로 few-shot learning 방식으로 새로운 state-of-the-art 성능을 달성할 수 있다. 많은 벤치마크에서 Flamingo는 수천 배 더 많은 task-specific 데이터로 fine-tuning된 모델보다 더 나은 성능을 보여준다.

Figure 1: Flamingo-80B로부터 얻은 입력과 출력의 선택된 예시. Flamingo는 few-shot prompting만으로 다양한 이미지/비디오 이해 task에 빠르게 적응할 수 있다 (상단). 또한 Flamingo는 별도의 fine-tuning 없이도 multi-image visual dialogue를 수행할 수 있는 능력을 갖추고 있다 (하단). 더 많은 예시는 Appendix C에 제시되어 있다.

Figure 2: Flamingo 결과 개요. 왼쪽: 우리의 가장 큰 모델인 Flamingo는 fine-tuning 없이도 우리가 다룬 16개의 task 중 6개에서 state-of-the-art로 fine-tuned된 모델보다 더 뛰어난 성능을 보인다. 또한, few-shot 결과가 공개된 9개의 task에서는 Flamingo가 새로운 few-shot state-of-the-art 성능을 기록한다. 참고로, 16번째 벤치마크인 RareAct는 비교할 수 있는 fine-tuned 결과가 없는 zero-shot 벤치마크이므로 생략하였다. 오른쪽: Flamingo의 성능은 모델 크기와 few-shot 예시의 개수가 많아질수록 향상된다.

1 Introduction

지능의 핵심적인 측면 중 하나는 간단한 지시만으로 새로운 task를 빠르게 학습하는 능력이다 [33, 70]. 컴퓨터 비전 분야에서도 이러한 능력에 대한 초기적인 진전이 있었지만, 여전히 가장 널리 사용되는 접근 방식은 대규모 supervised 데이터로 모델을 사전학습한 뒤, 관심 있는 task에 대해 fine-tuning을 수행하는 것이다 [66, 118, 143]. 하지만 이러한 fine-tuning은 수천 개 이상의 annotated data가 필요하며, 각 task별로 세심한 하이퍼파라미터 튜닝이 요구되고, 많은 자원이 소모된다는 단점이 있다.
최근에는 contrastive objective로 학습된 multimodal vision-language model들이 등장하면서, fine-tuning 없이도 새로운 task에 zero-shot으로 적응하는 것이 가능해졌다 [50, 85]. 하지만 이러한 모델들은 단순히 텍스트와 이미지 간의 유사도 점수만 제공하기 때문에, **미리 정의된 제한된 결과 집합(class label)**이 있는 분류 문제와 같은 제한된 use case에만 사용할 수 있다. 이들은 언어 생성 능력이 없어, captioning이나 visual question answering과 같은 open-ended task에는 적합하지 않다. 이에 반해, 몇몇 연구들은 **시각 정보에 조건을 거는 언어 생성(visual-conditioned language generation)**을 시도했으나 [17, 114, 119, 124, 132], 소량의 데이터로 학습하는 few-shot setting에서는 성능이 좋지 않았다.

우리는 이러한 문제를 해결하고자 Flamingo를 소개한다. Flamingo는 **Visual Language Model (VLM)**로서, Figure 1에서 보여주듯 소수의 input/output 예시만으로 prompt를 구성하는 것만으로도, 다양한 open-ended vision-language task에서 새로운 few-shot state-of-the-art 성능을 달성한다. 우리가 다룬 16개의 task 중 6개에서는 기존 fine-tuned SOTA보다 더 뛰어난 성능을 보이며, 이는 Flamingo가 훨씬 적은 task-specific training data를 사용함에도 불구하고 달성한 성과이다 (Figure 2 참조).
이를 가능하게 하기 위해 Flamingo는 **few-shot 학습에서 우수한 성능을 보인 최신 대형 language model (LM)**들의 구조에서 영감을 받았다 [11, 18, 42, 86]. 이러한 LMs는 텍스트 기반의 interface를 통해 다양한 task를 수행할 수 있으며, 소수의 예시들과 쿼리 입력을 prompt로 주면, 해당 쿼리에 대한 예측 결과를 생성할 수 있다. 우리는 이러한 방식이 **이미지와 비디오 이해 task들(예: 분류, 캡셔닝, 질문응답 등)**에도 적용될 수 있음을 보인다. 이들 task는 시각 정보에 기반한 텍스트 생성 문제로 변환될 수 있다.
LM과의 차이점은, Flamingo는 텍스트와 이미지/비디오가 섞여 있는 multimodal prompt를 처리할 수 있어야 한다는 점이다. Flamingo는 이러한 요구를 충족하는 모델로, 시각 정보를 조건으로 받아들이는 autoregressive text generation model이다. 즉, 텍스트 token과 이미지/비디오가 섞여 있는 시퀀스를 입력받아 텍스트를 출력할 수 있다.
Flamingo는 두 개의 사전학습된 모델을 조합하여 활용한다:

시각 정보를 인지(perceive)할 수 있는 vision model,
기초적인 reasoning을 수행할 수 있는 대형 language model.

이 둘 사이에 새로운 아키텍처 구성 요소를 삽입하여, 각 모델이 사전학습 동안 축적한 지식을 그대로 유지한 채 연결되도록 설계되었다.
또한 Flamingo는 **Perceiver 기반 아키텍처 [48]**를 통해 고해상도의 이미지나 비디오도 효율적으로 처리할 수 있다. 이 구조는 큰 규모의 시각 입력으로부터 고정된 수의 visual token을 생성할 수 있어, 다양한 크기의 이미지/비디오 입력을 수용 가능하게 한다.

Figure 3: Flamingo 아키텍처 개요. Flamingo는 텍스트와 섞여 있는 시각적 데이터를 입력으로 받아 자유형식의 텍스트를 출력하는 Visual Language Model (VLM) 계열의 모델이다.

대형 Language Model (LM)의 성능에 있어 핵심적인 요소 중 하나는 방대한 양의 텍스트 데이터로 학습되었다는 점이다. 이러한 학습은 범용적인 텍스트 생성 능력을 모델에 부여하며, task 예시만으로도 뛰어난 성능을 발휘할 수 있게 해준다. 이와 유사하게, Flamingo 모델의 학습 방식 또한 최종 성능에 매우 중요한 역할을 한다는 것을 우리는 실험적으로 보여준다. Flamingo 모델은 기계학습을 위해 별도로 주석 처리되지 않은, 웹에서 수집한 다양한 대규모 멀티모달 데이터로 구성된 **신중하게 설계된 데이터 혼합(mixture)**을 이용해 학습된다. 이러한 학습을 거친 후 Flamingo는 어떠한 task-specific 튜닝 없이도, 단순한 few-shot prompting만으로 시각적 task에 직접 활용 가능하다.

기여 사항 (Contributions)

요약하면, 본 논문의 주요 기여는 다음과 같다:
(i) 우리는 **Flamingo 계열의 Visual Language Model (VLM)**을 소개한다. 이 모델은 few-shot input/output 예시만으로도 captioning, visual dialogue, visual question-answering과 같은 다양한 멀티모달 task를 수행할 수 있다. 아키텍처적 혁신 덕분에, Flamingo는 텍스트와 시각 데이터를 임의로 섞은 입력을 효율적으로 수용하고, 자유형식의 텍스트를 생성할 수 있다.
(ii) 우리는 Flamingo 모델이 다양한 task에 대해 few-shot learning을 통해 어떻게 적응 가능한지를 정량적으로 평가한다. 특히, 우리는 이 접근법의 디자인 결정이나 하이퍼파라미터 튜닝에 전혀 사용되지 않은 대규모의 held-out 벤치마크 세트를 따로 보존하여, 편향되지 않은 few-shot 성능을 추정하는 데 활용한다.
(iii) Flamingo는 언어 및 이미지/비디오 이해와 관련된 16개의 멀티모달 task에서 few-shot learning 기준으로 새로운 state of the art을 달성한다. 이 중 6개 task에서는 단 32개의 task-specific 예시만을 사용하고도, 기존 fine-tuned SOTA 모델보다 더 나은 성능을 보여준다. 이는 기존 SOTA보다 약 1000배 적은 task-specific training data를 사용한 결과다. 또한, 더 큰 어노테이션 budget이 주어진다면, Flamingo는 VQAv2, VATEX, VizWiz, MSRVTTQA, HatefulMemes와 같은 5개의 추가적인 고난이도 벤치마크에서도 새로운 SOTA 성능을 fine-tuning을 통해 달성할 수 있다.

2 Approach

이 섹션에서는 Flamingo를 설명한다. Flamingo는 텍스트와 이미지/비디오가 섞인(interleaved) 입력을 받아 자유 형식의 텍스트를 출력하는 Visual Language Model이다. Figure 3에 나타난 핵심 아키텍처 구성 요소들은 사전학습된 vision 및 language model을 효과적으로 연결하기 위해 설계되었다.
첫째, Perceiver Resampler(Section 2.1)는 Vision Encoder로부터 얻은 시공간(spatio-temporal) feature들을 입력받아(입력은 이미지 또는 비디오일 수 있음), 고정된 개수의 visual token을 출력한다.
둘째, 이렇게 생성된 visual token들은 사전학습된 Language Model(LM) 내부에 새롭게 초기화된 cross-attention layer를 삽입하여 언어 생성에 조건(condition)으로 활용된다 (Section 2.2). 이 cross-attention layer는 다음 token을 예측하는 과정에서 LM이 시각 정보를 유연하게 통합할 수 있도록 해주는 강력한 구조이다.
Flamingo는 이미지/비디오와 섞인 텍스트 시퀀스가 주어졌을 때, 텍스트 $y$ 의 확률 분포를 다음과 같이 모델링한다:

p(y \mid x) = \prod_{\ell=1}^{L} p\left(y_{\ell} \mid y_{<\ell}, x_{\leq \ell}\right)

여기서

$y_{\ell}$ 은 입력 텍스트의 $\ell$ -번째 language token,
$y_{<\ell}$ 은 $\ell$ -번째 token 이전의 모든 token,
$x_{\leq \ell}$ 은 $y_{\ell}$ 이전에 등장한 이미지/비디오들의 집합,
$p$ 는 Flamingo 모델에 의해 parameterized된 확률 분포이다.

이처럼 텍스트와 시각 정보가 섞인 시퀀스를 처리할 수 있는 능력(Section 2.3)은 Flamingo를 GPT-3의 few-shot prompting과 유사하게 in-context few-shot learning에 자연스럽게 적용할 수 있게 만든다. 본 모델은 다양한 dataset의 혼합으로 학습되며, 이에 대한 자세한 내용은 Section 2.4에서 설명한다.

Figure 4: GATED XATTN-DENSE layer.
Language Model(LM)에 시각 정보를 조건으로 제공하기 위해, 우리는 기존의 사전학습된 고정된 LM layer 사이에 새로운 cross-attention layer를 삽입한다. 이 cross-attention layer에서 key와 value는 vision feature로부터 얻어지고, query는 language input으로부터 유도된다.
cross-attention 뒤에는 dense feed-forward layer가 이어진다.
이러한 layer들은 gate로 제어되며, 이를 통해 초기화 시점에서 LM의 본래 구조를 손상시키지 않고 안정성과 성능을 향상시킬 수 있다.

2.1 Visual processing and the Perceiver Resampler

Vision Encoder: 픽셀로부터 feature 추출
우리의 vision encoder는 사전학습된 Normalizer-Free ResNet (NFNet) [10]이며, F6 모델을 사용한다. 이 vision encoder는 이미지-텍스트 쌍으로 구성된 데이터셋을 기반으로, Radford et al. [85]의 two-term contrastive loss를 이용한 contrastive objective로 사전학습되었다.
Encoder의 출력은 2D 공간상의 feature grid이며, 이는 **1D 시퀀스로 평탄화(flatten)**되어 사용된다. 비디오 입력의 경우, 초당 1프레임(FPS)으로 프레임을 샘플링하고, 각 프레임은 개별적으로 인코딩된다. 이를 통해 3D spatio-temporal feature grid가 생성되며, 여기에 학습된 temporal embedding이 더해진다. 이렇게 얻은 feature는 1D 시퀀스로 평탄화된 후 Perceiver Resampler에 입력된다.
contrastive 모델 학습 및 성능에 대한 자세한 내용은 각각 Appendix B.1.3과 Appendix B.3.2에서 설명되어 있다.

Perceiver Resampler: 크기가 다양한 대형 feature map을 소수의 visual token으로 변환
이 모듈은 Figure 3에서 보여지듯, vision encoder와 고정된(frozen) language model을 연결하는 역할을 한다. Perceiver Resampler는 vision encoder로부터 추출된 이미지 또는 비디오 feature들을 입력으로 받아, 고정된 개수(64개)의 visual output을 생성한다. 이는 vision-text cross-attention의 연산 복잡도를 효과적으로 줄여준다.
Perceiver [48] 및 DETR [13]에서와 유사하게, 우리는 사전에 정의된 개수의 latent input query를 학습하며, 이들은 Transformer에 입력되어 시각적 feature에 cross-attention을 수행한다.
ablation study (Section 3.3)에서는 이러한 vision-language resampler 모듈이 단순한 Transformer나 MLP보다 더 우수한 성능을 보인다는 것을 입증한다. 이 모듈에 대한 시각적 예시, 아키텍처 세부사항, 그리고 pseudo-code는 Appendix A.1.1에 제시되어 있다.

2.2 Conditioning frozen language models on visual representations

텍스트 생성(text generation)은 Perceiver Resampler가 생성한 시각 표현(visual representation)을 조건으로 하는 Transformer decoder에 의해 수행된다. 우리는 사전학습된 고정(frozen)된 텍스트 전용 LM 블록들과, Perceiver Resampler의 시각 출력을 cross-attend하는 방식으로 새로 학습되는 블록들을 교차(interleave)하여 배치한다.

Interleaving new GATED XATTN-DENSE layers within a frozen pretrained LM
우리는 **사전학습된 LM 블록들은 그대로 고정(freeze)**시키고, 새로 학습되는 gated cross-attention dense block(Figure 4 참조)을 기존 layer들 사이에 삽입한다. 초기화 시, 조건부(conditioned) 모델의 출력이 기존 language model의 출력과 동일하도록 하기 위해, 우리는 tanh-gating mechanism을 사용한다 [41]. 이 방식에서는 새로 삽입된 layer의 출력에 대해 $\tanh(\alpha)$ 를 곱한 후, residual connection을 통해 들어온 입력에 더한다. 여기서 $\alpha$ 는 layer마다 개별적으로 학습되는 scalar 파라미터이며, 초기값은 0으로 설정된다 [4]. 따라서 초기 시점에서는 모델의 출력이 기존 사전학습된 LM과 동일하게 되며, 학습 안정성 및 최종 성능 향상에 기여한다.
ablation study (Section 3.3)에서는 제안된 GATED XATTN-DENSE layer와 최근 대안들 [22, 68]을 비교하고, 이러한 layer를 얼마나 자주 삽입할지를 조절함으로써 효율성과 표현력 간의 trade-off를 탐구한다. 자세한 내용은 Appendix A.1.2를 참조하라.

다양한 모델 크기 (Varying model sizes). 우리는 세 가지 모델 크기에서 실험을 수행하였으며, 이는 각각 Chinchilla 모델 [42]의 1.4B, 7B, 70B 파라미터 버전을 기반으로 한다. 이들을 각각 Flamingo-3B, Flamingo-9B, Flamingo-80B로 명명하였다. 본 논문 전체에서는 가장 큰 Flamingo-80B를 간단히 Flamingo라고 부른다.
Frozen된 LM의 파라미터 수와 trainable한 vision-text GATED XATTN-DENSE 모듈의 규모는 모델 크기에 따라 증가시키되, **vision encoder(frozen)**와 **Perceiver Resampler(trainable)**는 모든 모델에서 동일한 크기로 고정되어 있으며, 이는 전체 모델 크기에 비해 상대적으로 작다.
자세한 구성은 Appendix B.1.1을 참조하라.

2.3 Multi-visual input support: per-image/video attention masking

Equation (1)에서 도입된 image-causal modeling은 전체 **text-to-image cross-attention 행렬을 마스킹(masking)**함으로써 구현되며, 이를 통해 각 텍스트 토큰이 볼 수 있는 시각 토큰의 범위를 제한한다. 즉, 주어진 텍스트 토큰에서 모델은 interleaved 시퀀스 내에서 바로 직전에 등장한 이미지의 시각 토큰만을 attend하며, 그 이전의 모든 이미지에는 직접적으로 attend하지 않는다 (이에 대한 공식화 및 도식은 Appendix A.1.3 참조).
하지만 모델은 한 번에 하나의 이미지에만 직접 attend하더라도, **LM 내부의 self-attention을 통해 간접적으로 모든 이전 이미지들과의 종속성(dependency)**을 유지하게 된다.
이러한 single-image cross-attention 방식은 중요한 장점이 있다. 즉, 훈련 시 사용된 이미지 수와 관계없이, 어떤 개수의 시각 입력에도 자연스럽게 일반화할 수 있다는 점이다. 실제로 우리는 학습 시 interleaved dataset에서 시퀀스당 최대 5장의 이미지만을 사용했음에도, **평가 시에는 이미지/비디오-텍스트 쌍(pair)**을 최대 32개까지 포함하는 시퀀스에서도 성능 향상이 가능함을 확인했다.
Section 3.3에서는 이 방식이, 모델이 이전 모든 이미지에 직접 cross-attend하도록 하는 방식보다 더 효과적이라는 것을 실험적으로 보여준다.

2.4 Training on a mixture of vision and language datasets

우리는 Flamingo 모델을 웹에서 수집한 세 가지 종류의 데이터셋 혼합물로 학습시킨다:

웹페이지로부터 추출된 이미지-텍스트가 섞여 있는(interleaved) 데이터셋,
이미지-텍스트 쌍 데이터셋,
비디오-텍스트 쌍 데이터셋.

M3W: 이미지-텍스트가 섞인(interleaved) 데이터셋
Flamingo 모델의 few-shot 능력은 텍스트와 이미지가 섞여 있는 데이터로의 학습에 기반한다. 이를 위해 우리는 MultiModal MassiveWeb (M3W) 데이터셋을 구축하였다. 약 4,300만 개의 웹페이지 HTML로부터 텍스트와 이미지를 추출하였으며, 문서의 DOM(Document Object Model) 구조 내에서 텍스트와 이미지 요소의 상대적인 위치를 기준으로 이미지의 텍스트 내 위치를 결정하였다.
하나의 예시는 다음과 같이 구성된다:

페이지 내 이미지 위치에 <image> 태그를 일반 텍스트에 삽입하고,
각 이미지 앞과 문서 끝에는 학습 가능한 특수 토큰 <EOC>(end of chunk)를 삽입한다.

각 문서로부터 임의로 256개의 토큰으로 구성된 subsequence를 샘플링하고, 해당 시퀀스에 포함된 처음 5개 이미지까지만 사용한다. 연산 비용 절약을 위해 그 이후의 이미지는 제거한다. 자세한 내용은 Appendix A.3에 설명되어 있다.

이미지/비디오-텍스트 쌍 데이터셋 (Pairs of image/video and text)
이미지-텍스트 쌍 데이터셋으로는 먼저 ALIGN [50] 데이터셋을 활용하는데, 이는 약 18억 개의 이미지와 alt-text 쌍으로 구성되어 있다. 이와 보완적으로, 우리는 **보다 긴 설명과 높은 품질을 목표로 하는 이미지-텍스트 쌍 데이터셋 LTIP (Long Text & Image Pairs)**을 새롭게 수집하였고, 이는 3억 1,200만 쌍으로 구성된다.
또한, **정지 이미지 대신 비디오를 포함하는 데이터셋인 VTP (Video & Text Pairs)**도 수집하였다. VTP는 평균 22초 분량의 짧은 비디오 2,700만 개와 그에 대응되는 문장 단위의 설명 텍스트로 이루어져 있다.
이러한 paired 데이터셋들은 M3W와 문법(syntax)을 일치시키기 위해, 각 caption 앞에는 <image>를 붙이고, 끝에는 <EOC>를 추가하였다 (자세한 내용은 Appendix A.3.3 참조).

다중 목적 학습 및 최적화 전략 (Multi-objective training and optimisation strategy)
모델은 각 데이터셋에 대해 다음과 같은 **시각 입력이 주어진 상태에서의 텍스트 생성 확률에 대한 expected negative log-likelihood의 가중합(weighted sum)**을 최소화하도록 학습된다:

\sum_{m=1}^{M} \lambda_{m} \cdot \mathbb{E}_{(x, y) \sim \mathcal{D}_{m}}\left[-\sum_{\ell=1}^{L} \log p\left(y_{\ell} \mid y_{<\ell}, x_{\leq \ell}\right)\right]

여기서 $\mathcal{D}_m$ 은 $m$ 번째 데이터셋, $\lambda_m$ 은 해당 데이터셋의 가중치이다.
각 데이터셋별 가중치 $\lambda_m$ 를 조정하는 것이 성능에 핵심적인 요소이다. 우리는 모든 데이터셋에 걸쳐 gradient를 누적하는 방식을 사용했으며, 이는 [17]에서 제안된 round-robin 방식보다 우수한 성능을 보였다. 추가적인 학습 세부사항 및 ablation은 Appendix B.1.2에 수록되어 있다.

2.5 Task adaptation with few-shot in-context learning

Flamingo를 학습시킨 이후, 우리는 이를 멀티모달 interleaved prompt를 조건으로 하여 다양한 시각적 task에 적용한다. 우리는 GPT-3 [11]에서와 유사하게, Flamingo 모델이 in-context learning을 통해 새로운 task에 얼마나 빠르게 적응하는지를 평가한다. 이를 위해 (image, text) 또는 (video, text) 형태의 support example 쌍들을 interleave한 후, 그 뒤에 **쿼리 시각 입력(query visual input)**을 추가하여 prompt를 구성한다 (자세한 구성은 Appendix A.2 참조).

open-ended 평가는 beam search를 활용한 decoding으로 수행되고,
close-ended 평가는 모델이 각 정답 후보에 대해 계산한 log-likelihood 점수를 이용해 수행된다.

또한 우리는 zero-shot generalization도 탐구하는데, 이때는 해당 task의 텍스트 예시 2개만으로 prompt를 구성하고 시각 정보는 포함하지 않는다.
평가 시 사용한 하이퍼파라미터 및 추가적인 세부 사항은 Appendix B.1.5에 설명되어 있다.

Method	FT	Shot	OKVQA (I)	VQAv2 (I)	COCO (I)	MSVDQA (V)	VATEX (V)	VizWiz (I)	Flick30K (I)	MSRVTTQA (V)	iVQA (V)	YouCook2 (V)	STAR (V)	VisDial (I)	TextVQA (I)	NextQA (I)	HatefulMemes (I)	RareAct (V)
Zero/Few shot SOTA	$x$		[34]	[114]	[124]	[58]				[58]	[135]		[143]	[79]			[85]	[85]
			43.3	38.2	32.2	35.2	-	-	-	19.2	12.2	-	39.4	11.6	-	-	66.1	40.7
		(X)	(16)	(4)	(0)	(0)				(0)	(0)		(0)	(0)			(0)	(0)
Flamingo-3B	$x$	0	41.2	49.2	73.0	27.5	40.1	28.9	60.6	11.0	32.7	55.8	39.6	46.1	30.1	21.3	53.7	58.4
	$x$	4	43.3	53.2	85.0	33.0	50.0	34.0	72.0	14.9	35.7	64.6	41.3	47.3	32.7	22.4	53.6	-
	$x$	32	45.9	57.1	99.0	42.6	59.2	45.5	71.2	25.6	37.7	76.7	41.6	47.3	30.6	26.1	56.3	-
Flamingo-9B	$x$	0	44.7	51.8	79.4	30.2	39.5	28.8	61.5	13.7	35.2	55.0	41.8	48.0	31.8	23.0	57.0	57.9
	$x$	4	49.3	56.3	93.1	36.2	51.7	34.9	72.6	18.2	37.7	70.8	42.8	50.4	33.6	24.7	62.7	-
	$x$	32	51.0	60.4	106.3	47.2	57.4	44.0	72.8	29.4	40.7	77.3	41.2	50.4	32.6	28.4	63.5	-
Flamingo	$x$	0	50.6	56.3	84.3	35.6	46.7	31.6	67.2	17.4	40.7	60.1	39.7	52.0	35.0	26.7	46.4	60.8
	$x$	4	57.4	63.1	103.2	41.7	56.0	39.6	75.1	23.9	44.1	74.5	42.4	55.6	36.5	30.8	68.6	-
	$x$	32	57.8	67.6	113.8	52.3	65.1	49.8	75.4	31.0	45.3	86.8	42.2	55.6	37.9	33.5	70.0	-
			54.4	80.2	143.3	47.9	76.3	57.2	67.4	46.8	35.4	138.7	36.7	75.2	54.7	25.2	79.1
Pretrained FT SOTA	$\checkmark$		[34]	[140]	[124]	[28]	[153]	[65]	[150]	[51]	[135]	[132]	[128]	[79]	[137]	[129]	[62]	-
		(X)	( 10 K )	( 444 K )	( 500 K )	( 27 K )	( 500 K )	(20K)	( 30 K )	( 130 K )	(6K)	( 10 K )	(46K)	( 123 K )	(20K)	(38K)	(9K)

Table 1: 기존 state-of-the-art와의 비교
단일 Flamingo 모델은 다양한 이미지(I) 및 비디오(V) 이해 task에 대해 few-shot learning만으로도 state-of-the-art 성능을 달성하며, 기존의 zero-shot 및 few-shot 방법들을 단 4개의 예시만으로도 크게 능가한다. 더 중요한 것은, 단 32개의 예시만 사용하고 모델 가중치를 전혀 업데이트하지 않은 상태로도, Flamingo는 수천 개의 annotated example로 fine-tuning된 기존 최고 성능의 방법들을 7개 task에서 능가한다는 점이다.
표에서 가장 뛰어난 few-shot 성능은 굵게(bold), **전체 최고 성능은 밑줄(underline)**로 표시되어 있다.

3 Experiments

우리의 목표는 다양하고 도전적인 task에 빠르게 적응할 수 있는 모델을 개발하는 것이다. 이를 위해 우리는 총 16개의 대표적인 멀티모달 이미지/비디오 및 언어 벤치마크를 고려한다.
프로젝트 진행 중 모델 설계 결정을 검증하기 위해, 이 중 5개의 벤치마크는 개발용(DEV) 세트로 사용되었다: COCO, OKVQA, VQAv2, MSVDQA, VATEX.
이러한 DEV 벤치마크에 대한 성능 추정치는 모델 선택 과정에서의 편향이 존재할 수 있음에 유의해야 한다. 이는 유사한 벤치마크를 설계 검증 및 ablation에 활용한 기존 연구들에서도 동일하게 나타나는 현상이다.
이를 보완하기 위해 우리는 captioning, video question-answering, 그리고 visual dialogue 및 multi-choice question-answering과 같은 잘 탐색되지 않은 영역을 포함한 추가적인 11개의 벤치마크에서의 성능도 함께 보고한다. 이 평가용 벤치마크들에 대한 설명은 Appendix B.1.4에 제시되어 있다. 우리는 모든 벤치마크에서 동일한 evaluation 하이퍼파라미터를 사용하며, task에 따라 총 4가지 few-shot prompt 템플릿 중 하나를 선택해 적용한다 (자세한 내용은 Appendix B.1.5 참조).
우리는 특히 강조한다: 이 11개의 평가용 벤치마크에서는 어떠한 설계 결정도 검증하지 않았으며, 모델의 편향 없는 few-shot 학습 성능을 추정하는 목적으로만 사용하였다.

보다 구체적으로 말하자면, 모델의 few-shot 학습 성능을 평가할 때는 support 샘플들을 prompt로 주고, query 샘플에 대해 성능을 측정한다.
설계 결정 및 하이퍼파라미터 검증에 사용된 DEV 벤치마크에서는 다음 4개의 subset을 사용한다:

validation support, validation query, test support, test query

반면, 그 외의 벤치마크에서는 test support와 test query만 사용하면 된다.
이러한 subset 구성 방식은 Appendix B.1.4에 설명되어 있다.

Section 3.1에서는 Flamingo 모델의 few-shot 학습 성능을 보고하고, Section 3.2에서는 fine-tuning 결과를 제시하며, Section 3.3에서는 ablation study를 제공한다.
추가적인 실험 결과는 Appendix B.2에 포함되어 있으며, 여기에는 ImageNet 및 Kinetics700 분류 task에서의 성능과 Flamingo의 contrastive 모델 성능이 포함된다.
Appendix C에는 추가적인 qualitative 결과도 수록되어 있다.

3.1 Few-shot learning on vision-language tasks

Few-shot 결과
결과는 Table 1에 제시되어 있다.
Flamingo는 16개의 벤치마크 전반에 걸쳐 기존의 모든 zero-shot 및 few-shot 방법들을 큰 차이로 능가한다.
이는 task당 단 4개의 예시만으로 달성된 성과로, vision 모델이 새로운 task에 실용적이고 효율적으로 적응할 수 있음을 보여준다.
더 중요한 점은, Flamingo가 수십만 개의 annotated example로 추가 fine-tuning된 state-of-the-art 방법들과도 종종 경쟁력 있는 성능을 보인다는 것이다.
심지어 6개의 task에서는 Flamingo가 단 하나의 고정된 모델 가중치와 32개의 task-specific 예시만을 사용하고도 fine-tuned SotA를 능가하는 성능을 기록했다.

Method	VQAV2 test-dev test-std		COCO test	VATEX test	VizWiz test-dev test-std		MSRVTTQA test	VisDial valid test-std		YouCook2 valid	TextVQA valid test-std		HatefulMemes test seen
${ }^{3} 32$ shots	67.6	-	113.8	65.1	49.8	-	31.0	56.8	-	86.8	36.0	-	70.0
${ }^{7}$ Fine-tuned	82.0	$\underline{82.1}$	138.1	84.2	65.7	65.4	47.4	61.8	59.7	118.6	57.1	54.1	$\underline{86.6}$
SotA	$81.3^{\dagger}$ [133]	$81.3^{\dagger}$ [133]	149.6 $^{\dagger}$ [119]	$81.4^{\dagger}$ [153]	$57.2^{\dagger}$ [65]	$60.6^{\dagger}$ [65]	46.8 [51]	75.2 [79]	$\mathbf{7 5 . 4}^{\dagger}$ [123]	138.7 [132]	54.7 [137]	73.7 [84]	$84.6^{\dagger}$ [152]

Table 2: Flamingo fine-tuning 시 SotA와의 비교
우리는 few-shot learning만으로 SotA를 달성하지 못한 9개의 task에 대해 Flamingo를 fine-tuning하였다. 그 결과, 그 중 5개의 task에서 Flamingo가 새로운 SotA를 달성하였으며, 이는 모델 앙상블, domain-specific metric 최적화 (예: CIDEr 최적화) 등의 **특수 기법을 사용하는 기존 방법들(† 표시)**보다도 뛰어난 성능을 보인다.

	Ablated setting	Flamingo-3B original value	Changed value	Param. count $\downarrow$	Step time $\downarrow$	COCO CIDEr $\uparrow$	OKVQA top1 $\uparrow$	VQAv2 top1 $\uparrow$	MSVDQA top1 $\uparrow$	VATEX CIDEr $\uparrow$	Overall score $\uparrow$
Flamingo-3B model				3.2B	1.74s	86.5	42.1	55.8	36.3	53.4	70.7
(i)	Training data	All data	w/o Video-Text pairs	3.2B	1.42s	84.2	43.0	53.9	34.5	46.0	67.3
			w/o Image-Text pairs	3.2B	0.95s	66.3	39.2	51.6	32.0	41.6	60.9
			Image-Text pairs $\rightarrow$ LAION	3.2B	1.74s	79.5	41.4	53.5	33.9	47.6	66.4
			w/o M3W	3.2B	1.02 s	54.1	36.5	52.7	31.4	23.5	53.4
(ii)	Optimisation	Accumulation	Round Robin	3.2B	1.68s	76.1	39.8	52.1	33.2	40.8	62.9
(iii)	Tanh gating	$\checkmark$	$\times$	3.2B	1.74s	78.4	40.5	52.9	35.9	47.5	66.5
(iv)		GATED XATTN-DENSE	Vanilla Xattn	2.4B	1.16s	80.6	41.5	53.4	32.9	50.7	66.9
	Cross-attention architecture		Grafting	3.3 B	1.74 s	79.2	36.1	50.8	32.2	47.8	63.1
(v)	Cross-attention frequency	Every	Single in middle	2.0 B	0.87s	71.5	38.1	50.2	29.1	42.3	59.8
			Every 4th	2.3 B	1.02s	82.3	42.7	55.1	34.6	50.8	68.8
			Every 2nd	2.6 B	1.24s	83.7	41.0	55.8	34.5	49.7	68.2
(vi)	Resampler	Perceiver	MLP	3.2B	1.85s	78.6	42.2	54.7	35.2	44.7	66.6
			Transformer	3.2 B	1.81s	83.2	41.7	55.6	31.5	48.3	66.7
(vii)	Vision encoder	NFNet-F6	CLIP ViT-L/14	3.1B	1.58s	76.5	41.6	53.4	33.2	44.5	64.9
			NFNet-F0	2.9 B	1.45s	73.8	40.5	52.8	31.1	42.9	62.7
(viii)	Freezing LM	$\checkmark$	$\boldsymbol{x}$ (random init)	3.2B	2.42s	74.8	31.5	45.6	26.9	50.1	57.8
			$\boldsymbol{x}$ (pretrained)	3.2 B	2.42 s	81.2	33.7	47.4	31.0	53.9	62.7

Table 3: Ablation study 결과
각 행은 **baseline Flamingo 실행 결과(맨 위 행)**와 비교해야 한다. 여기서 Step time은 모든 학습 데이터셋에 대해 gradient update를 수행하는 데 소요된 시간을 나타낸다.

마지막으로, 우리는 설계 결정을 위해 DEV 벤치마크만을 사용했음에도 불구하고, 우리의 결과는 다른 벤치마크들에도 잘 일반화되었으며, 이는 우리 접근 방식의 범용성을 입증한다.

파라미터 수 및 shot 수에 따른 확장성(Scaling)
Figure 2에서 보여주듯이, 모델이 클수록 few-shot 성능이 더 우수하며, 이는 GPT-3 [11]와 유사한 경향이다. 또한, shot 수가 많아질수록 성능도 향상된다.
우리는 특히, 가장 큰 모델이 더 많은 shot 수를 활용하는 데에 더 능숙하다는 것을 발견했다. 흥미롭게도, Flamingo 모델은 M3W에서 최대 5개의 이미지로 제한된 시퀀스로 학습되었음에도, 추론 시에는 최대 32개의 이미지나 비디오로부터 성능 향상을 얻을 수 있다. 이는 Flamingo 아키텍처가 다양한 개수의 이미지나 비디오를 유연하게 처리할 수 있는 구조임을 보여주는 결과이다.

3.2 Fine-tuning Flamingo as a pretrained vision-language model

비록 본 연구의 주요 초점은 아니지만, 우리는 더 많은 데이터가 주어졌을 때 Flamingo 모델을 fine-tuning을 통해 특정 task에 적응시킬 수 있음을 확인하였다. Table 2에서는 어노테이션 예산에 제한 없이 주어진 task에 대해 가장 큰 Flamingo 모델을 fine-tuning하는 실험을 다룬다.
구체적으로는, 짧은 학습 스케줄과 작은 learning rate를 사용해 모델을 fine-tuning하며, 더 높은 입력 해상도를 수용하기 위해 vision backbone도 함께 unfreeze한다 (자세한 내용은 Appendix B.2.2 참조).
그 결과, 이전에 제시한 in-context few-shot 학습 성능을 능가하는 결과를 얻었으며, 다음 5개 task에서 새로운 state of the art를 달성하였다: VQAv2, VATEX, VizWiz, MSRVTTQA, HatefulMemes.

3.3 Ablation studies

Table 3에서는 Flamingo-3B를 사용해, 4-shot 설정에서 5개의 DEV 벤치마크의 validation subset에 대해 수행한 ablation 실험 결과를 보고한다. 이 실험에서는 최종 모델과 비교해 더 작은 batch size와 더 짧은 학습 스케줄을 사용하였다. Overall 점수는 각 벤치마크의 성능을 Table 1의 SotA 성능으로 나눈 뒤 평균을 취해 계산하였다. 추가적인 세부사항과 결과는 Appendix B.3 및 Table 10에 제시되어 있다.

학습 데이터 구성의 중요성
(i)번 실험에서 보듯이, 적절한 학습 데이터 선택은 성능에 결정적인 영향을 미친다. interleaved 이미지-텍스트 데이터셋 M3W를 제거하면 성능이 17% 이상 하락하며, 전통적인 paired 이미지-텍스트 데이터셋을 제거해도 성능이 9.8% 감소한다. 이는 서로 다른 유형의 데이터셋이 모두 중요함을 보여준다. 또한, 비디오-텍스트 데이터셋을 제거하면 모든 비디오 관련 task에서 성능이 하락하였다.
우리는 또한 자체 수집한 이미지-텍스트 쌍 데이터셋(ITP)을 공개된 LAION-400M [96]으로 대체하는 ablation도 수행했으며, 이 경우에도 성능이 약간 저하됨을 확인했다.
(ii)번 실험에서는 우리가 제안한 gradient accumulation 전략이 round-robin 업데이트 방식 [17]보다 더 효과적임을 보여준다.

Frozen LM의 시각 조건화(Visual Conditioning)
(iii)번 실험에서는 cross-attention 출력을 frozen LM에 병합할 때 사용하는 0으로 초기화된 tanh gating의 효과를 확인한다. 이를 제거할 경우 overall 점수가 4.2% 감소하며, 학습 안정성도 악화됨을 관찰했다.
(iv)번 실험에서는 다양한 조건화 아키텍처를 비교했다. VANILLA XATTN은 오리지널 Transformer decoder [115]의 기본 cross-attention 구조이며, **GRAFTING [68]**은 LM 자체는 그대로 사용하고, 그 출력 위에 새로운 self-attention 및 cross-attention layer 스택을 추가로 학습하는 방식이다.
이들에 비해 GATED XATTN-DENSE 방식이 가장 우수한 성능을 보였다.

성능 대비 연산/메모리 효율 (Compute/Memory vs. Performance Trade-offs)
(v)번 실험에서는 GATED XATTN-DENSE block을 삽입하는 빈도에 따른 trade-off를 분석하였다. 모든 layer에 삽입하면 성능은 좋지만 학습 시간이 증가하며, 4번째마다 삽입하는 경우, 학습 속도가 66% 빨라지면서도 overall 성능은 단 1.9%만 감소했다.
이러한 trade-off를 고려하여, Flamingo-9B에는 4번째마다, Flamingo-80B에는 7번째마다 GATED XATTN-DENSE를 삽입한다.
(vi)번 실험에서는 Perceiver Resampler를 MLP 또는 vanilla Transformer로 대체했을 때를 비교하였다. 두 대안 모두 성능이 더 낮고 속도도 느림을 확인했다.

Vision Encoder
(vii)번 실험에서는 우리의 contrastive 학습된 NFNet-F6 vision encoder와 공개된 CLIP ViT-L/14 [85] (224 해상도), 그리고 더 작은 NFNet-F0를 비교했다.
그 결과, NFNet-F6은 CLIP ViT-L/14보다 +5.8%, NFNet-F0보다 +8.0% 더 우수한 성능을 보였다. 이는 강력한 vision backbone의 중요성을 강조한다.

LM 고정(freezing)의 중요성: catastrophic forgetting 방지
(viii)번 실험에서는 학습 시 LM layer를 고정시키는 것이 얼마나 중요한지를 검증하였다.
LM을 scratch부터 학습시키면 성능이 12.9% 급감하며, 사전학습된 LM을 fine-tuning하더라도 성능이 8.0% 감소한다.
이는 학습 중 모델이 사전학습에서 얻은 지식을 점점 잊어버리는 "catastrophic forgetting" [71] 현상이 발생함을 의미한다. 우리 실험에서는 LM을 고정(freeze)시키는 것이, pretraining 데이터셋(MassiveText)을 혼합하여 함께 학습시키는 것보다도 더 효과적인 대안이었다.

Language modelling과 few-shot adaption
Transformer [115]의 등장 이후, 언어 모델링은 상당한 발전을 이루어왔다. 대규모 데이터로 먼저 사전학습(pretraining)을 수행한 후, 다운스트림 task에 적응(adaptation)하는 패러다임은 이제 표준으로 자리 잡았다 [11, 23, 32, 44, 52, 75, 87, 108]. 본 연구에서는 Flamingo의 기반 언어 모델로 70B 규모의 Chinchilla 모델 [42]을 사용하였다. 몇몇 선행 연구에서는 언어 모델을 소수의 예시만으로 새로운 task에 적응시키는 다양한 방법들을 탐구해왔다. 예를 들어, 작은 adapter 모듈을 삽입하거나 [43], LM의 일부만 fine-tuning 하거나 [141], in-context 예시를 prompt에 삽입하거나 [11], 또는 gradient descent를 통해 prompt 자체를 최적화하는 방식 [56, 60] 등이 있다. 본 논문에서는 metric learning 기반 few-shot 학습 [24, 103, 112, 117]이나 meta-learning 기반 접근법 [6, 7, 27, 31, 91, 155]처럼 복잡한 방식이 아닌, GPT-3에서 소개된 in-context few-shot learning 기법 [11]에서 영감을 받아 이를 Flamingo에 적용하였다.

언어와 비전의 만남
이러한 언어 모델의 발전은 vision-language 모델링에도 큰 영향을 끼쳤다. 특히 BERT [23]는 다수의 vision-language 관련 연구들 [16, 28, 29, 38, 59, 61, 66, 101, 106, 107, 109, 118, 121, 142, 143, 151]에 영감을 주었다. 그러나 Flamingo는 이들과 달리 새로운 task에 대해 fine-tuning을 요구하지 않는 점에서 차별화된다. 또 다른 vision-language 모델 계열은 contrastive learning 기반 모델이다 [2, 5, 49, 50, 57, 74, 82, 85, 138, 140, 146]. Flamingo는 이들과 달리 텍스트를 생성할 수 있다는 점에서 차별되며, Flamingo의 vision encoder는 이러한 contrastive 모델에 기반해 설계되었다. 본 연구와 유사하게, 일부 VLM은 autoregressive 방식으로 텍스트를 생성할 수 있도록 설계되어 있다 [19, 25, 45, 67, 116]. 최근에는 여러 vision task들을 텍스트 생성 문제로 정식화하려는 연구도 진행되고 있다 [17, 58, 119, 124, 154]. 사전학습된 대형 언어 모델을 기반으로 vision-language 모델을 구축하는 방향 역시 여러 연구들에서 탐색되고 있으며, 그중 일부 [26, 68, 78, 114, 136, 144]는 **catastrophic forgetting [71]을 방지하기 위해 언어 모델 가중치를 고정(freeze)**하는 방식을 제안한다. 우리도 이러한 접근을 따르며, Chinchilla LM의 layer를 freeze하고 그 내부에 학습 가능한 layer들을 삽입하였다. 그러나 기존 연구들과 달리, 우리는 임의로 섞여 있는 이미지, 비디오, 텍스트를 모두 입력으로 수용할 수 있는 최초의 LM을 제안한다는 점에서 차별성을 가진다.

웹 규모의 비전-언어 학습 데이터셋
수작업으로 주석된 vision-language 데이터셋은 제작 비용이 매우 크기 때문에 규모가 상대적으로 작으며, 일반적으로 1만~10만 개 수준이다 [3, 15, 69, 122, 129, 139]. 이러한 데이터 부족 문제를 해결하기 위해, 여러 연구 [14, 50, 98, 110]에서는 웹에서 쉽게 수집 가능한 이미지-텍스트 쌍 데이터를 자동으로 수집하는 방식이 제안되어 왔다. 본 연구는 이러한 paired 데이터 외에도, 이미지와 텍스트가 섞여 있는(multimodal interleaved) 전체 웹페이지를 하나의 시퀀스로 학습하는 것의 중요성을 강조한다. 동시 진행된 연구인 CM3 [1]에서는 웹페이지를 HTML 마크업으로 생성하는 방식을 택하였지만, 우리는 텍스트 생성 문제를 단순화하기 위해 plain text만을 생성 대상으로 설정하였다. 또한 우리는 few-shot 학습 및 vision task 성능을 중점적으로 평가한 반면, CM3 [1]은 zero-shot 또는 fine-tuning된 언어 전용 벤치마크를 중심으로 평가를 진행하였다.

5 Discussion

한계점
첫째, 우리의 모델은 사전학습된 언어 모델(LM)에 기반하고 있으며, 이로 인해 해당 언어 모델의 약점을 그대로 물려받는다는 부작용이 있다. 예를 들어, LM의 사전 지식(prior)은 일반적으로 유용하지만, 가끔 환각(hallucination)이나 근거 없는 추측을 야기할 수 있다. 또한, LM은 학습 시 사용된 시퀀스보다 긴 입력에 대해 일반화 성능이 낮으며, 훈련 시 sample efficiency도 떨어지는 문제를 가지고 있다. 이러한 문제들을 해결한다면, 본 분야의 발전을 가속화하고 Flamingo와 같은 VLM의 성능을 더욱 향상시킬 수 있을 것이다.

둘째, Flamingo는 이미지 분류(classification) 성능에 있어서 최신 contrastive 학습 기반 모델들 [82, 85]보다 뒤처진다. 해당 contrastive 모델들은 text-image retrieval을 직접적으로 최적화하는데, 이는 분류 문제의 특수한 형태에 해당한다. 반면, Flamingo는 보다 다양한 형태의 open-ended task를 다룰 수 있도록 설계되었다. 이러한 두 방식의 장점을 결합하는 unified 접근법은 앞으로 중요한 연구 방향이 될 수 있다.

셋째, in-context learning은 gradient 기반 few-shot 학습 방법에 비해 여러 장점이 있지만, 적용 대상 task의 특성에 따라 단점도 존재한다. 본 논문에서는 수십 개 수준의 적은 예시만 주어졌을 때 in-context learning이 효과적임을 입증하였다. 또한, in-context learning은 추론(inference)만으로 동작하며, 일반적으로 하이퍼파라미터 튜닝 없이도 손쉬운 배포가 가능하다는 이점이 있다. 그러나, in-context learning은 demonstration 구성의 다양한 요소에 매우 민감한 것으로 알려져 있으며 [80, 148], shot 수가 일정 수준 이상으로 늘어날 경우 계산 비용과 절대 성능이 비효율적으로 증가한다. 따라서, 서로 보완적인 few-shot 학습 기법들을 결합하는 방법에 가능성이 있다. 이와 관련된 한계점은 Appendix D.1에서 더 상세히 논의한다.

사회적 영향
Flamingo는 여러 긍정적인 잠재력을 가지고 있지만, 동시에 일정한 위험성도 수반한다. Flamingo는 적은 데이터로도 다양한 task에 빠르게 적응하는 능력을 가지며, 이는 비전문 사용자도 데이터 부족 환경에서 높은 성능을 달성할 수 있게 해준다. 이러한 특성은 유익한 활용뿐 아니라 악의적인 용도에도 악용될 수 있는 가능성을 내포한다.
Flamingo는 기존 대형 언어 모델과 동일한 위험, 예를 들어 모욕적인 언어 생성, 사회적 편향 및 고정관념의 확산, 민감한 정보 누출 등의 위험에 노출되어 있다 [42, 126]. 더 나아가, Flamingo는 시각 입력을 처리할 수 있는 능력으로 인해, 입력 이미지의 내용에 따라 성별, 인종과 관련된 편향을 초래할 수 있는 위험성 또한 내포하고 있다. 이러한 위험은 기존의 여러 시각 인식 시스템에서 관찰된 바 있다 [12, 21, 37, 97, 147].
우리는 본 연구의 긍정적, 부정적 사회적 영향에 대한 보다 자세한 논의와 함께, 성별 및 인종 편향, 유해 출력에 대한 위험성의 조기 탐색 및 대응 전략을 Appendix D.2에 정리해 두었다. 마지막으로, 이전의 언어 모델 연구들 [72, 81, 111]에서 보여주었듯, Flamingo의 few-shot 능력은 이러한 위험을 완화하는 데에도 긍정적인 역할을 할 수 있는 잠재력이 있음을 언급한다.

결론
우리는 본 논문에서 Flamingo를 제안하였다. Flamingo는 최소한의 task-specific 학습 데이터만으로 이미지 및 비디오 task에 적용 가능한 범용 모델 계열이다. 또한 우리는 Flamingo의 "대화"와 같은 상호작용적 능력을 정성적으로 탐구하였으며, 이는 전통적인 비전 벤치마크를 넘어서는 유연성을 보여준다. 우리의 결과는, 강력한 사전학습된 언어 모델과 시각 모델을 연결하는 것이 범용 시각 이해 모델로 가는 중요한 단계임을 시사한다.

감사의 말 및 연구 자금 지원 명시
본 연구는 DeepMind의 지원을 받아 수행되었다. 우리는 아래 동료들에게 유익한 논의, 제안, 피드백, 조언을 제공해 준 것에 감사드린다: Samuel Albanie, Relja Arandjelović, Kareem Ayoub, Lorrayne Bennett, Adria Recasens Continente, Tom Eccles, Nando de Freitas, Sander Dieleman, Conor Durkan, Aleksa Gordić, Raia Hadsell, Will Hawkins, Lisa Anne Hendricks, Felix Hill, Jordan Hoffmann, Geoffrey Irving, Drew Jaegle, Koray Kavukcuoglu, Agustin Dal Lago, Mateusz Malinowski, Soňa Mokrá, Gaby Pearl, Toby Pohlen, Jack Rae, Laurent Sifre, Francis Song, Maria Tsimpoukelli, Gregory Wayne, 그리고 Boxi Wu.

Checklist

1 모든 저자에 대하여...
(a) 초록 및 서론에서 제시한 주요 주장들이 논문의 기여 및 범위와 정확히 일치하나요? [예]
(b) 연구의 한계점을 기술했나요? [예] Section 5 참조.
(c) 연구의 잠재적인 부정적 사회적 영향에 대해 논의했나요? [예] 간략한 논의는 Section 5에, 전체 논의는 Appendix D.2에 포함되어 있음.
(d) 윤리 심사 가이드라인을 읽고, 논문이 이에 부합하는지 확인했나요? [예]

2 이론적 결과를 포함하는 경우...
(a) 모든 이론적 결과에 대한 전제 조건을 명확히 기술했나요? [해당 없음]
(b) 모든 이론적 결과에 대한 완전한 증명을 포함했나요? [해당 없음]

3 실험을 수행한 경우...
(a) 주요 실험 결과를 재현할 수 있도록 코드, 데이터, 사용 방법 등을 (부록이나 URL을 통해) 제공했나요? [아니오] 코드와 데이터는 독점적 자산임.
(b) 학습 세부사항(예: 데이터 분할, 하이퍼파라미터, 선택 방법 등)을 명시했나요? [예] Section 3 및 Appendix B 참조.
(c) 오차 막대(error bar)를 보고했나요? (예: 여러 번 실험 수행 후 random seed에 따른 변동성 등) [아니오] 실험 반복에 따른 분산이 크지 않았으며, 가장 큰 모델의 경우 계산 자원 제약으로 인해 여러 번의 학습은 현실적으로 어렵다고 판단함.
(d) 총 연산량 및 사용된 리소스 종류(GPU 종류, 내부 클러스터, 클라우드 제공업체 등)를 명시했나요? [예] 자세한 내용은 Appendix B.1.2에 있으며, 요약하면 가장 큰 실험은 TPU 1536개로 15일간 학습됨.

4 기존 자산(코드, 데이터, 모델 등)을 사용하거나 새롭게 구축/공개한 경우...
(a) 기존 자산을 사용했다면, 해당 제작자를 인용했나요? [예] 우리의 연구가 기반한 이전 방법과 적절한 경우 관련된 기존 데이터셋(예: ALIGN)을 적절히 인용함.
(b) 사용한 자산의 라이선스를 언급했나요? [해당 없음] 사용한 자산은 인용한 논문에서 나온 것이며, 논문 내 도표에 사용된 시각 자료의 라이선스는 Appendix G에서 언급함.
(c) 새로운 자산을 supplemental 자료나 URL을 통해 포함했나요? [아니오]
(d) 사용/수집한 데이터가 개인 정보 또는 타인의 데이터인 경우, 이에 대한 동의 여부를 논의했나요? [예] 우리의 데이터는 수백만 개의 웹페이지로부터 자동 수집된 것이며, 자세한 내용은 Appendix F의 Datasheets [30]에 있음.
(e) 사용/수집한 데이터에 개인정보 또는 불쾌감을 줄 수 있는 콘텐츠가 포함되어 있는지 여부를 논의했나요? [예] Appendix F의 Datasheets [30]에 기술되어 있음.

5 크라우드소싱을 사용하거나 사람을 대상으로 한 연구를 수행한 경우...
(a) 참가자에게 제공된 전체 지시사항과 스크린샷(해당되는 경우)을 포함했나요? [해당 없음]
(b) 참가자에게 발생할 수 있는 위험과 관련된 IRB 승인 여부 및 링크를 기술했나요? [해당 없음]
(c) 참가자에게 지급된 시간당 보상금과 총 보상액을 명시했나요? [해당 없음]

Appendix

다음은 Appendix에 대한 개요이다.
Method (Appendix A) 먼저 Appendix A.1에서는 모델에 대한 추가적인 세부 사항을 제공한다:

Perceiver Resampler(Section 2.1에 설명됨)의 도식 및 pseudo-code는 Appendix A.1.1과 Figure 5에 수록되어 있다.
GATED XATTN-DENSE layer(Section 2.2에 설명됨)의 유사한 도식은 Appendix A.1.2와 Figure 4에 포함되어 있다.
다중 이미지/비디오 attention 메커니즘(Section 2.3)의 구현 세부 사항은 Appendix A.1.3에 설명되어 있다.
모든 모델 아키텍처에 대한 하이퍼파라미터는 Appendix A.1.4에 제시되어 있다.

이후, in-context few-shot learning을 이용해 모델을 평가하는 방법을 Appendix A.2에서 설명한다. 여기에는 few-shot prompt를 구성하는 방법, open-ended 및 close-ended task에 대한 예측을 수행하는 방식, zero-shot 수치 추정 방식, 그리고 더 많은 annotated 예시를 활용하기 위한 retrieval 및 ensembling 기법이 포함된다.

마지막으로, Appendix A.3에서는 학습 데이터셋에 대한 자세한 설명을 제공한다:

M3W 수집 과정은 Appendix A.3.1에,
학습 중 M3W 샘플 처리 방식은 Appendix A.3.2에,
LTIP 및 VTP 수집 과정은 Appendix A.3.3에,
학습/평가 데이터셋 간 누수 방지를 위한 중복 제거 전략은 Appendix A.3.4에 각각 기술되어 있다.

Experiments (Appendix B)
먼저 Appendix B.1에서는 학습 및 평가 관련 세부사항을 추가로 설명한다:

Flamingo-3B, Flamingo-9B, Flamingo 모델 구성은 Appendix B.1.1에,
학습 하이퍼파라미터는 Appendix B.1.2에,
Contrastive 모델 사전학습 세부 정보는 Appendix B.1.3에,
평가 벤치마크 및 데이터 분할은 Appendix B.1.4에,
few-shot 학습 시의 하이퍼파라미터 설정은 Appendix B.1.5에,
Figure 1 및 Figure 11의 정성적 대화 예시에서 사용된 대화 prompt는 Appendix B.1.6에 각각 수록되어 있다.

다음으로, Appendix B.2에서는 Flamingo 모델의 추가 실험 결과를 제시한다. 여기에는 분류 task에 대한 Flamingo 성능 (Appendix B.2.1), fine-tuning 결과 (Appendix B.2.2), **contrastive 모델의 zero-shot 결과 (Appendix B.2.3)**가 포함된다.

마지막으로, Appendix B.3에서는 Flamingo 모델 (Appendix B.3.1)과 사전학습된 contrastive Visual Encoder (Appendix B.3.2)에 대한 추가적인 ablation study를 제공한다.

Qualitative results (Appendix C)
Appendix C에서는 추가적인 정성적 결과들을 제시한다. 여기에는 Figure 10 (단일 이미지 예시), Figure 11 (대화 예시), **Figure 12 (비디오 예시)**가 포함된다.

Discussion (Appendix D)
Appendix D에서는 본 연구의 한계, 실패 사례, 광범위한 영향 및 사회적 영향에 대해 보다 심도 있게 논의한다.

Model card (Appendix E)
Appendix E에는 Flamingo의 model card가 포함되어 있다.

Datasheets (Appendix F)
Appendix F.1에는 M3W, Appendix F.2.1에는 LTIP, Appendix F.2.2에는 VTP에 대한 datasheet가 각각 수록되어 있다.

Credit for visual content (Appendix G)
논문에 사용된 모든 시각 자료에 대한 출처 및 저작권 표기는 Appendix G에 제공된다.

def perceiver_resampler(
    x_f, # The [T, S, d] visual features (T=time, S=space)
    time_embeddings, # The [T, 1, d] time pos embeddings.
    x, # R learned latents of shape [R, d]
    num_layers, # Number of layers
):
    """The Perceiver Resampler model."""
    # Add the time position embeddings and flatten.
    x_f = x_f + time_embeddings
    x_f = flatten(x_f) # [T, S, d] -> [T * S, d]
    # Apply the Perceiver Resampler layers.
    for i in range(num_layers):
        # Attention.
        x = x + attention_i(q=x, kv=concat([x_f, x]))
        # Feed forward.
        x = x + ffw_i(x)
    return x

Figure 5: Perceiver Resampler 모듈은 Vision Encoder로부터 출력된 가변 크기의 시공간(spatio-temporal) 시각 feature grid를 고정된 개수의 output token(그림에서는 5개)으로 매핑한다. 이 과정은 입력 이미지의 해상도나 입력 비디오 프레임 수와 무관하게 동작한다.
이 Transformer 구조에서는 학습된 latent vector 집합이 query로 사용되며, **key와 value는 시공간 시각 feature와 학습된 latent vector를 결합(concatenate)**한 것으로 구성된다.

A Method

A. 1 Model details

A.1.1 Perceiver Resampler

Section 2.1에서 간략히 설명한 내용을 확장하여, Figure 5는 Perceiver Resampler가 비디오 예시를 처리하는 과정을 시각적으로 보여주며, 함께 pseudo-code도 제공한다. 우리의 Perceiver Resampler는 Jaegle et al. [48]이 제안한 Perceiver 모델들과 유사한 철학을 따른다. 우리는 사전 정의된 개수의 latent input query를 학습하고, 이를 **평탄화(flatten)된 시각 feature $X_f$ **에 대해 cross-attention을 수행한다.
이 시각 feature $X_f$ 는 다음과 같은 방식으로 생성된다. 비디오의 각 프레임(이미지의 경우 단일 프레임 비디오로 간주)의 feature에 대해 학습된 temporal position encoding을 추가한다. 주의할 점은, 우리는 temporal encoding만 사용하고, 명시적인 spatial grid position encoding은 사용하지 않았다는 것이다. 후자를 사용했을 때 성능 향상이 관찰되지 않았기 때문이다.
그 배경에는 다음과 같은 이유가 있다: NFNet encoder와 같은 CNN은 채널 차원에서 암묵적으로 공간 정보를 내포하고 있는 것으로 알려져 있기 때문이다 [47].
이후 시각 feature는 **평탄화(flatten)되어 하나의 시퀀스로 연결(concatenate)**되며, 이 과정은 Figure 5에 시각적으로 설명되어 있다. Perceiver Resampler의 출력 token 수는 학습된 latent query의 수와 동일하다.
DETR나 기존 Perceiver와는 달리, 우리는 **학습된 latent로부터 계산한 key와 value를, 시각 feature $X_f$ 로부터 계산한 key와 value에 추가(concatenate)**하는 방식으로 사용하며, 이 방식이 약간 더 나은 성능을 보이는 것을 확인하였다.

A.1.2 GATED XATTN-DENSE details

우리는 Figure 4에서 GATED XATTN-DENSE 블록의 구조와 그것이 frozen된 LM 블록에 어떻게 연결되는지, 그리고 해당 구조의 pseudo-code를 함께 제공한다.

또한, Figure 6에서는 Flamingo-3B 모델의 24개 LM layer에 대해, 학습 진행률(0%에서 100%)에 따른 tanh gating 값의 절대값 변화 추이를 시각화하였다. 모든 frozen LM layer에서, tanh gating 값의 절대값이 초기값인 0에서 빠르게 증가하는 양상을 보이며, 이를 통해 각 layer가 시각 정보를 적극적으로 활용하고 있음을 유추할 수 있다.
또한, layer의 깊이에 따라 tanh gating 값의 절대값도 증가하는 경향이 관찰되지만, 이는 단정적인 결론을 내리기 어려운 현상이다. 그 이유는 gating 이전의 activation의 scale 자체도 layer 깊이에 따라 달라질 수 있기 때문이다.

이러한 추가 layer들이 최적화 동역학(optimization dynamics) 및 모델 자체에 미치는 영향을 보다 깊이 이해하기 위해서는 추가적인 후속 연구가 필요하다.

Figure 6: Flamingo-3B의 서로 다른 layer에서 tanh gating의 절대값 변화 추이

Figure 7: 섞여 있는(interleaved) 시각 데이터와 텍스트 지원 방식
웹페이지와 같이 이미지/비디오가 텍스트 사이에 섞여 있는 경우, 우리는 먼저 텍스트 내 시각 데이터의 위치에 <image> 태그를 삽입하고, 시퀀스 시작을 나타내는 <BOS> 토큰과 chunk 종료를 나타내는 <EOC> 토큰과 같은 특수 토큰도 함께 삽입한다.
이미지들은 **Vision Encoder와 Perceiver Resampler를 통해 개별적으로 처리되어 시각 토큰(visual token)**으로 변환된다. 모델은 각 텍스트 토큰에서, 그보다 앞서 등장한 마지막 이미지/비디오에 해당하는 시각 토큰만을 cross-attention으로 참조한다. $\phi$ 는 각 텍스트 토큰이 참조할 수 있는 이미지/비디오를 나타내며, 앞선 시각 정보가 없는 경우에는 0으로 표시된다.
실제로 이러한 선택적 cross-attention은 마스킹(masking)을 통해 구현되며, 그 예시는 그림에서 진한 파란색(마스킹 해제 / 볼 수 있음), **연한 파란색(마스킹됨 / 볼 수 없음)**으로 시각화되어 있다.

A.1.3 Multi-visual input support

우리는 Figure 7에서 특정 텍스트 토큰이 볼 수 있는 시각 토큰의 수를 제한하기 위해 사용하는 마스킹 방식을 시각적으로 설명한다. 또한, 이미지/비디오와 텍스트가 섞인(interleaved) 시퀀스에 대한 표기법도 공식화한다.

시각 데이터와 텍스트가 섞인 시퀀스 (Interleaved sequences of visual data and text)
우리는 이미지/비디오와 텍스트가 섞여 있는 예시를 다룬다. 각 예시는 다음 세 가지로 구성된다:

텍스트 시퀀스 $y$ ,
이미지/비디오 시퀀스 $x$ ,
텍스트 내에서 이미지가 등장하는 위치에 대한 시퀀스.

시각 데이터의 위치를 기준으로, 우리는 다음과 같은 **함수 $\phi: [1, L] \mapsto [0, N]$ **를 정의한다. 이 함수는 각 텍스트 위치 $\ell$ 에 대해, 그 위치 이전에 등장한 마지막 이미지/비디오의 인덱스를 반환한다. 만약 그 위치 이전에 어떤 시각 데이터도 등장하지 않았다면 $\phi(\ell) = 0$ 이다.

이 함수 $\phi$ 는 **Equation (1)에서 텍스트 토큰 $\ell$ **을 예측할 때 어떤 시각 입력을 사용할 수 있는지를 정의한다:

앞선 텍스트 토큰들의 집합은 $y_{<\ell} \triangleq (y_1, \ldots, y_{\ell-1})$ ,
앞선 이미지/비디오들의 집합은 $x_{\leq \ell} \triangleq \{x_i \mid i \leq \phi(\ell)\}$ 이다.

A.1.4 Transformer architecture

Table 4에는 Flamingo 모델의 각 Transformer 구성 요소에 대해 다음 항목들을 정리하였다: 레이어 수 $L$ , hidden dimension $D$ , 헤드 수 $H$ , 그리고 Feed-Forward(FFW)에서 사용하는 활성화 함수(Act.)이다.
각 구성에서 key와 value의 차원은 $D / H$ 로 설정되며, Perceiver Resampler의 경우 96, GATED XATTN-DENSE와 frozen LM의 경우에는 128이다. 또한, 각 feed-forward MLP의 hidden dimension은 $4D$ 로 설정되어 있다.
참고로, **frozen LM은 GeLU activation [39]**으로 사전학습되었으며, 나머지 **학습 가능한 Transformer layer들에는 Squared ReLU activation [104]**을 사용한다. 우리는 실험을 통해 Squared ReLU가 GeLU보다 더 나은 성능을 보인다는 사실을 확인하였다.

	Perceiver Resampler				GATED XATTN-DENSE				Frozen LM
	L	D	H	Act.	L	D	H	Act.	L	D	H	Act.
Flamingo-3B	6	1536	16	Sq. ReLU	24	2048	16	Sq. ReLU	24	2048	16	GeLU
Flamingo-9B	6	1536	16	Sq. ReLU	10	4096	32	Sq. ReLU	40	4096	32	GeLU
Flamingo	6	1536	16	Sq. ReLU	12	8192	64	Sq. ReLU	80	8192	64	GeLU

Table 4: Flamingo 모델의 Transformer에 대한 하이퍼파라미터
각 feedforward MLP의 hidden 크기는 $4D$ 이다. $\mathbf{L}$ : 레이어 수, $\mathbf{D}$ : Transformer hidden 크기, $\mathbf{H}$ : 헤드 수, Act.: Feed-Forward에서 사용하는 활성화 함수, Sq. ReLU: Squared ReLU [104].

Figure 8: Few-shot interleaved prompt 생성 방식
Flamingo가 예측을 수행해야 하는 쿼리와 함께, 몇 개의 task-specific few-shot 예시(즉, support example)가 주어졌을 때, 우리는 이미지와 해당 텍스트를 번갈아(interleave) 배치하여 prompt를 구성한다.
이를 위해 특정 포맷을 도입하며, vision-to-text task의 경우에는 예상 응답 앞에 "Output:"을 붙이고, visual question-answering task의 경우에는 "Question: {질문} Answer: {답변}" 형식으로 prompt를 구성한다.

A. 2 In-context few-shot evaluation details

Flamingo 모델을 활용한 In-context learning
우리는 GPT-3 [11]에서 사용된 접근 방식과 유사하게, Flamingo 모델이 새로운 task에 빠르게 적응할 수 있는지 in-context learning을 통해 평가한다.
구체적으로, (image, text) 또는 (video, text) 형태의 support example 집합이 주어지며, 여기서 image 또는 video는 시각 입력이고, text는 **예상되는 응답 또는 추가적인 task-specific 정보(예: 질문)**이다. 또한, 모델이 예측을 수행해야 하는 단일 visual query도 함께 제공된다.
이러한 정보를 바탕으로, 우리는 Figure 8에서 보여주듯, support example들을 시각 쿼리 앞에 연결(concatenate)하여 멀티모달 prompt를 구성한다. 별도의 명시가 없는 한, example들의 연결 순서는 무작위로 선택한다.

Open-ended 및 Close-ended 평가 방식
Open-ended 설정에서는, 모델이 쿼리 이미지 이후에 생성한 텍스트를 해당 이미지에 대한 예측으로 간주하며, 첫 번째 <EOC>(end of chunk) 토큰이 생성될 때까지 텍스트 생성을 진행한다. 특별한 언급이 없는 한, 우리는 항상 beam size 3의 beam search를 사용한다.
Close-ended 설정에서는, 모든 후보 응답들을 쿼리 이미지 뒤에 독립적으로 덧붙인 후, 각 시퀀스에 대해 모델이 계산한 log-likelihood를 기반으로 점수화한다. 이 점수들을 이용해 **후보 응답들을 신뢰도 순(높은 → 낮은)**으로 정렬한다.

Figure 9: 학습 데이터셋
서로 다른 형식의 학습 데이터셋으로 구성된 혼합 구조를 나타낸다. $N$ 은 하나의 예시에 포함된 시각 입력(이미지 또는 비디오)의 개수를 의미하며, **paired image (또는 video)-text 데이터셋에서는 $N = 1$ **이다. $T$ 는 비디오 프레임 수를 의미하며, **이미지의 경우 $T = 1$ **이다. $H, W, C$ 는 각각 **높이(height), 너비(width), 색상 채널 수(channel)**를 나타낸다.

Zero-shot generalization
few-shot 예시가 없는 상황에서는, 모델이 inference 시 task에 적절한 자연어 설명을 조건으로 활용하도록 하는 prompt engineering 기법이 일반적으로 사용된다 [85]. 하지만 이러한 prompt를 검증하고 선택하는 과정은 성능에 큰 영향을 줄 수 있음에도 불구하고, annotated 예시를 필요로 하므로 진정한 의미의 zero-shot으로 간주될 수 없다. 게다가, Perez et al. [80]는 validation 과정에서 예시 수가 적을 경우, 성능이 쉽게 불안정해짐을 실험적으로 보였다.
우리의 연구에서는 zero-shot 성능을 평가하기 위해, 다운스트림 task에서 가져온 예시 중 해당 이미지 또는 비디오를 제거하여 텍스트만 포함된 2개의 예시로 prompt를 구성하였다. 예를 들어, Figure 8 상단에 나오는 task의 경우, prompt는 다음과 같이 구성된다: <BOS>Output: This is a cat wearing sunglasses.<EOC>Output: Three elephants walking in the savanna.<EOC><image> Output: 이때, 모델에는 support 이미지가 제공되지 않는다.
우리는 텍스트 예시를 1개만 제공하는 경우, 모델이 그 예시와 유사한 형식의 응답을 생성하는 경향이 강해져 성능이 크게 저하됨을 확인했다. 반면, 2개를 제공하는 것이 실용성과 성능 간 균형 측면에서 가장 효과적이었으며, 2개 이상을 제공하는 것은 성능을 약간만 향상시킬 뿐이었다. 따라서 모든 zero-shot 결과에서는 텍스트 예시 2개만을 사용하였다. 실제로 이는 주어진 task에 대해 적절한 자연어 설명을 찾는 것보다 더 번거롭지 않다고 판단된다. 이러한 접근은 recent finding인 demonstration의 구성 방식이 성능에 미치는 주요 요인이라는 연구 [76]와도 관련이 있다.
Close-ended task의 경우, 후보 정답들에 대해 모델이 점수를 매기는 방식이므로, zero-shot prompt에 단일 텍스트 예시조차 포함할 필요가 없다.

Retrieval-based In-Context Example Selection (RICES) [136]
support set의 크기가 일정 수준을 넘어서게 되면, in-context learning으로 모든 예시를 효과적으로 활용하는 것이 어려워질 수 있다. 그 이유는 크게 두 가지다: 첫째, 모든 예시를 prompt에 넣기엔 너무 많은 연산 비용이 발생하며, 둘째, prompt의 길이가 학습 시 사용된 시퀀스 길이를 초과하면 일반화 성능이 저하될 위험이 있기 때문이다 [83].
이런 경우, prompt selection 기법을 사용하면 prompt 길이를 줄이고, 동시에 품질도 향상시켜 성능을 개선할 수 있다 [63]. 우리는 이러한 접근 중 하나인 Retrieval-based In-Context Example Selection (RICES) 기법 [136]을 따른다.
구체적으로는, 쿼리 이미지가 주어졌을 때, 사전학습된 frozen visual encoder로부터 추출한 시각 feature를 기준으로 support set 내에서 유사한 이미지들을 검색한다. 그 후, 가장 유사한 상위 $N$ 개 예시를 연결하여 prompt를 구성한다.
또한, language model은 prompt에서 **최근 등장한 정보에 민감(recency bias)**하므로 [148], 유사도가 낮은 예시부터 높은 예시 순으로 정렬하여 가장 유사한 예시가 쿼리 바로 앞에 오도록 구성한다.
우리는 특히 수백 개 이상의 class가 존재하는 분류 task 설정에서 이 접근의 효과를 보여준다 (Appendix B.2.1 참조). 이 설정에서는 class마다 여러 개의 이미지/비디오가 주어지기 때문에, 예시 수가 prompt 길이를 초과하는 경우가 많다.

Prompt ensembling
우리는 close-ended setting에서 여러 개의 prompt에 대해 모델 출력을 앙상블하는 방법도 실험하였다. 이 방식은 RICES와도 결합 가능하며, 유사한 예시들의 순서를 여러 가지로 섞어 생성한 prompt에 대해 모델 출력을 평균하는 식으로 동작한다.
구체적으로는, 선택된 few-shot 예시의 6가지 무작위 순열에 대해, 각 정답 후보의 log-likelihood를 모델이 계산한 뒤 평균을 취해 최종 점수를 산출한다.

A. 3 Training dataset details

우리는 Figure 9에 시각적으로 나타나 있고, 아래에 설명된 다양한 데이터셋의 조합을 신중하게 선택하여 Flamingo 모델을 학습시켰다.

A.3.1 $M 3 W$ collection

** $M3W$ **의 웹페이지 수집 및 크롤링 과정은 MassiveWeb 데이터셋 [86]을 수집할 때 사용된 방식과 유사한 절차를 따른다. 우리는 먼저 영어가 아닌 문서들을 필터링하여 제외하고, 이미지, 비디오, 텍스트에 걸쳐 explicit 콘텐츠를 식별하는 내부 필터를 통과하지 못한 문서들 또한 제거한다.
그 이후, 우리는 텍스트와 이미지가 섞인 형태의 평문 콘텐츠를 추출하기 위해 커스텀 크롤러를 사용하며, 이 과정은 Section 2.4에서 설명한 방식과 동일하다.
$M3W$ 의 텍스트는 MassiveWeb과 유사한 방식으로 수집되지만, 우리는 여기에 추가로 HTML 트리 상에서 동일한 수준에 위치한 이미지들도 함께 수집한다. 스크래핑된 결과에서 이미지가 포함되지 않은 문서들은 제거한다.

그 다음, 우리는 텍스트 품질을 높이기 위해 반복적이거나 품질이 낮은 문서를 제거하고, 다음과 같은 조건에 해당하는 이미지를 제거하는 이미지 필터링 절차도 적용한다:

이미지의 너비 또는 높이가 64픽셀 미만인 경우
이미지의 가로세로 비율이 3 이상으로 지나치게 넓거나 좁은 경우
단색 이미지와 같이 품질이 명확히 낮은 이미지

이러한 필터링 이후에도 이미지가 남아 있지 않은 문서들은 최종적으로 모두 폐기한다.

A.3.2 $M 3 W$ image-placement augmentation

Flamingo 모델을 평가할 때는 이미지를 입력으로 제공하고 해당 이미지에 대한 텍스트를 생성하도록 프롬프트를 구성한다. 이는 자연스럽게 추론 시 이미지가 먼저 오고, 그에 따른 텍스트가 이어지는 순서로 이어진다.

하지만, interleaved된 M3W 데이터셋(Section 2.4 참조)에서는 이미지와 텍스트 간의 명확한 대응 관계가 일반적으로 알려져 있지 않으며, 일부 경우에는 그 관계 자체가 명확하게 정의되지 않을 수도 있다.
이를 설명하기 위한 동기 부여 예시로, 단순한 웹페이지 구조는 다음 두 가지 형태 중 하나일 수 있다:
(a) This is my dog! <dog image> This is my cat! <cat image>
(b) <dog image> That was my dog! <cat image> That was my cat!

이때 **텍스트에 대응되는 이미지의 인덱스(index)**는 이상적으로는 해당 텍스트와 의미적으로 가장 관련이 깊은 이미지를 가리키는 것이 바람직하다. 예를 들어 (a)에서는 다음 이미지가, (b)에서는 이전 이미지가 해당 텍스트와 관련 있을 것이다.
그러나 실제 웹페이지들에서는 이러한 의미적 일치를 일반적으로 판별하는 방법이 없기 때문에, 우리는 다음과 같은 단순화된 가정을 적용한다: 텍스트의 각 위치에서 가장 관련 있는 이미지는 직전에 등장한 이미지이거나 직후에 등장하는 이미지 중 하나라는 것이다. 위 예시 구조들을 고려하여, 인덱스는 이러한 규칙에 따라 결정된다.

학습 과정에서는 각 웹페이지 샘플에 대해, 텍스트가 이전 이미지와 대응되는지 또는 다음 이미지와 대응되는지를 확률 $p_{\text{next}} = \frac{1}{2}$ 로 무작위로 결정한다. 이 방식은 불가피하게 (a)의 경우에서처럼 "This is my cat!"이라는 문장이 강아지 이미지와 잘못 연결되는 비자연적인 결과를 약 절반의 확률로 만들게 된다.
Section 3.3에서 이 선택에 대한 ablation 실험을 수행한 결과, $p_{\text{next}} = \frac{1}{2}$ 으로 설정하는 것이 항상 이전 이미지(index=0) 또는 항상 다음 이미지(index=1) 를 선택하는 것보다 성능 면에서 약간의 이점을 가지는 것으로 나타났다. 이는 이 무작위 선택 방식이 일종의 "데이터 증강(data augmentation)" 효과를 제공할 수 있음을 시사한다.

A.3.3 LTIP and VTP: Visual data paired with text

Along with our interleaved image and text dataset, we use several paired vision and text web datasets for training. One dataset is ALIGN [50], composed of 1.8 billion images paired with alt-text. ALIGN is large, but noisy and limited to images. The images are often poorly described by the corresponding alt-text annotation. For this reason, we augment it with two datasets: LTIP (Long Text & Image Pairs) consists of 312 million images, and VTP (Video & Text Pairs) consists of 27 million short videos (approximately 22 seconds on average). Both datasets are paired with more descriptive captions. For instance, the average number of tokens of an ALIGN text description is 12.4 per image, while it is 20.5 for the LTIP dataset. The LTIP and VTP datasets were collected by crawling fewer than ten websites targeting high-quality and rich image descriptions. These single-image and single-video datasets are preprocessed analogously to the $M 3 W$ data preprocessing described previously, adding the <image> tag at the beginning of the sequence (immediately after <BOS>), and the <EOC> token after the text (before <EOS>). We deduplicated these datasets against all our benchmarks (against both the training and the evaluation sets) using image similarity, as detailed in Appendix A.3.4. Datasheets for LTIP and VTP are respectively given in Appendix F.2.1 and Appendix F.2.2.

	Requires model sharding	Frozen		Trainable		Total count
		Language	Vision	GATED XATTN-DENSE	Resampler
Flamingo-3B	$x$	1.4 B	435 M	1.2 B (every)	194 M	3.2B
Flamingo-9B	$\times$	7.1 B	435 M	1.6B (every 4th)	194 M	9.3B
Flamingo	$\checkmark$	70 B	435 M	10B (every 7th)	194 M	80B

Table 5: Parameter counts for Flamingo models. We focus on increasing the parameter count of the frozen LM and the trainable vision-text GATED XATTN-DENSE modules while maintaining the frozen vision encoder and trainable Resampler to a fixed and small size across the different models. The frequency of the GATED XATTN-DENSE with respect to the original language model blocks is given in parentheses.

A.3.4 Dataset deduplication against evaluation tasks

We used an internal deduplication tool to deduplicate our training datasets from our evaluation datasets. This deduplication pipeline relies on a trained visual encoder which maps embedding closer together when they are potential duplicates. Once the image embeddings have been computed, a fast approximate nearest neighbor search is performed on the training images to retrieve duplicate candidates from the validation datasets. For the paired image-text dataset, we have deduplicated our LTIP and ALIGN training images against: ImageNet (train, val), COCO (train, valid, test), OK-VQA (train, valid, test), VQAv2 (train, valid, test), Flickr30k (train, valid, test), VisDial (train, valid, test). We did not deduplicate our image datasets against VizWiz, HatefulMemes and TextVQA as we performed these evaluations only after having trained our Flamingo models. However, we believe this had no impact on our results as the images from these datasets are unlikely to be scraped from the web; VizWiz images were obtained using a specific mobile app and only available for download, HatefulMemes memes were created by researchers instead of being scraped on the web and finally TextVQA images are from OpenImages.

Note that we did not run the deduplication on the $M 3 W$ dataset as one training example is a full webpage of interleaved paragraph with several images, unlikely to contain images from our benchmark suite. To verify this hypothesis, we have obtained near-duplicate statistics on the 185 M individual images from $M 3 W$ and the results are the following: in total, 1314 potential duplicates were found from the validation and test splits of ImageNet, COCO, OK-VQA, VQAv2, Flickr30k and VisDial. Out of the 1314 candidates, only 125 are exact duplicates. For the video datasets, we did not perform any deduplication of VTP ( 27 M videos) as none of the collected VTP videos were obtained from YouTube or Flickr, which are the sources of all of our video evaluation datasets collected on the Internet.

B Experiments

B. 1 Training and evaluation details

B.1.1 Models

We perform experiments across three model sizes, where we scale the frozen language model from 1.4 B to 7 B and 70 B ; and adapt the parameter count of other components accordingly. We keep the pretrained vision encoder frozen across all experiments and use a NFNet-F6 model trained contrastively (see Appendix B.1.3), unless explicitly stated otherwise in the ablation study. We use a Perceiver Resampler with approximately 200 M parameters across all three model sizes. The decision on how many GATED XATTN-DENSE layers to interleave is mainly driven by a trade-off between memory constraints and downstream performance. We identified the optimal trade-off at small model scales, before transferring our findings to the large model architecture. We obtain three models, Flamingo-3B, Flamingo-9B and Flamingo-80B, detailed below:

The Flamingo-3B model builds on top of a $\mathbf{1 . 4 B}$ frozen language model from [42]. Before each transformer block, we add a GATED XATTN-DENSE layer attending to the visual inputs; this accounts for 1.4 B additional learned parameters.
The Flamingo-9B model builds on top of a 7B frozen language model from [42]. Starting from the very first layer and before every fourth transformer blocks, we add a GATED XATTN-DENSE layer attending to the visual inputs; this accounts for 1.8 B additional learned parameters.
The Flamingo-80B model builds on top of the frozen Chinchilla $\mathbf{7 0 B}$ language model [42]. Starting from the very first layer and before every seventh transformer blocks, we add a GATED XATTN-DENSE layer attending to the visual inputs; this accounts for 10B additional learned parameters. For simplicity, we refer to this model as simply Flamingo throughout the paper.

In Table 5 we report the parameter count of each component of our models, as well as model sharding requirements. We provide more Transformer architecture details in Appendix A.1.4. The Flamingo model card [77] is also given in Appendix E.

B.1.2 Training details for the Flamingo models

Data augmentation and preprocessing. Empirically we find that it is effective to stochastically prepend the paired dataset text samples with a single space character, with probability 0.5 . We attribute this to the fact that our subword tokenizer maps the beginning of various words to a different token depending on whether it is preceded by a space. This allows us to enforce invariance to this tokenizer artifact, without degrading significantly correctness of the punctuation which is already lacking in many of these samples. We observe that this leads to substantial improvement across tasks. The visual inputs are resized to $320 \times 320$ while preserving their aspect ratios, padding the image with the mean value if required. Note that this is higher than the $288 \times 288$ resolution used for the contrastive pretraining of our Vision Encoder (see Appendix B.1.3). The increase in resolution during the final stage training was motivated by [113] showing one can obtain improved performance at a higher test-time resolution when using CNNs. This increase in resolution also comes with only a moderate computational and memory cost as no backpropagation is performed through the frozen Vision Encoder. We also employ random left/right flips and color augmentation. For interleaved datasets (Section 2.4) we also employ augmentation by lightly randomizing the selected image indices $\phi$ with a hyperparameter $p_{\text {next }}$ when sampling examples from the $M 3 W$ dataset. This augmentation is detailed in Appendix A.3.2 and our choice of $p_{\text {next }}=\frac{1}{2}$ is ablated in Appendix B.3.1. For video training, we temporally sample a clip of 8 frames sampled at one frame per second (fps) from each training video. Although our model was trained with a fixed number of 8 frames, at inference time, we input 30 frames at 3 FPS. This is achieved by linearly interpolating the learnt temporal position embedding of the Perceiver Resampler at inference time.

Loss and optimisation. All our models are trained using the AdamW optimizer with global norm clipping of 1 , no weight decay for the Perceiver Resampler and weight decay of 0.1 for the other trainable parameters. The learning rate is increased linearly from 0 to $10^{-4}$ up over the first 5000 steps then held constant for the duration of training (no improvements were observed from decaying the learning rate). Unless specified otherwise we train our models for $500 k$ steps. Four datasets are used for training: $M 3 W$ , ALIGN, LTIP and VTP with weights $\lambda_{m}$ of $1.0,0.2,0.2$ and 0.03 respectively. These weights were obtained empirically at a small model scale and kept fixed afterwards. Batch sizes depend on the setting and are given in the next sections.

Infrastructure and implementation. Our model and associated infrastructure were implemented using JAX [8] and Haiku [40]. All training and evaluation was performed on TPUv4 instances. The largest model containing 80 billion parameters is trained on 1536 chips for 15 days and sharded across 16 devices. Megatron type sharding [99] is used to enable 16-way model parallelism for all Embedding / Self-Attention / Cross-Attention / FFW layers, while the NFNet vision layers were unsharded. ZeRO stage 1 [88] is used to shard the optimizer state. All trained parameters and optimizer accumulators are stored and updated in float32; all activations and gradients are computed in bfloat16 after downcasting of parameters from float32 to bfloat16. Frozen parameters are stored and applied in bfloat16.

B.1.3 Contrastive model details

The vision encoder is trained from scratch, together with a language encoder. Using these encoders, images and text pairs are separately encoded and projected to a shared embedding space and L2 normalized. From these embeddings, we maximize the similarity of paired embeddings and minimize the similarity of unpaired embeddings, using a multi-class cross-entropy loss, where the paired image-texts are treated as positive examples and the rest of the batch as negative examples. We use the same loss as in CLIP [85], which consists of two contrastive losses, one from text to image and the other from image to text. We use a learnable temperature parameter in the final log-softmax layer [9]. The text-to-image loss is as follows:

L_{\text {contrastive:txt } 2 i m}=-\frac{1}{N} \sum_{i}^{N} \log \left(\frac{\exp \left(L_{i}^{\top} V_{i} \beta\right)}{\sum_{j}^{N} \exp \left(L_{i}^{\top} V_{j} \beta\right)}\right)

And the image-to-text loss is defined analogously:

L_{\text {contrastive:im } 2 \text { txt }}=-\frac{1}{N} \sum_{i}^{N} \log \left(\frac{\exp \left(V_{i}^{\top} L_{i} \beta\right)}{\sum_{j}^{N} \exp \left(V_{i}^{\top} L_{j} \beta\right)}\right)

The sum of the two losses is minimized. Here, $V_{i}$ and $L_{i}$ are, respectively, the normalized embedding of the vision and language component of the $i$ -th element of a batch. $\beta$ is a trainable inverse temperature parameter and $N$ is the number of elements in the batch. We use the BERT [23] architecture for the language encoder. The outputs of the language and vision encoders are meanpooled (across tokens and spatial locations, respectively) before being projected to the shared embedding space. We only use the weights from the contrastive vision encoder in the main Flamingo models.

The vision encoder is pretrained on the ALIGN and LTIP datasets. The training image resolution is $288 \times 288$ , the joint embedding space is size 1376 and the batch size is 16,384 . It is trained for 1.2 million parameter update steps, each of which consist of two gradient calculation steps (more details below) on 512 TPUv4 chips. The learning rate is decayed linearly from $10^{-3}$ to zero over the course of training. Images have random color augmentation and horizontal flips applied during training. We use the tokenizer employed by Jia et al. [50]. The Adam optimizer is used to optimize the network, and we apply label smoothing of 0.1 . We apply $10^{-2}$ adaptive gradient clipping (AGC) [10] to the NFNet encoder and global norm gradient clipping of 10 for the BERT encoder.

To evaluate the pretrained model, we track zero-shot image classification and retrieval. For zero-shot image classification, we use image-text retrieval between the images and the class names. Following Radford et al. [85] we use "prompt-ensembling" in which we embed multiple texts using templates such as "A photo of a {class_name}" and average the resulting embedding.

B.1.4 Evaluation benchmarks

Our goal is to develop models that can rapidly adapt to diverse and challenging tasks in the few-shot setting. For this, we consider a wide array of popular image and video benchmarks summarized in Table 6. In total we chose 16 multimodal image/video and language benchmarks, spanning tasks that require some language understanding (visual question answering, captioning, visual dialogue) as well as two standard image and video classification benchmarks (ImageNet and Kinetics). Note that for the video datasets collected from YouTube (i.e., all video datasets except NextQA and STAR), we evaluated our model on all the publicly available video as of April 2022.

DEV benchmarks. In order to validate design decisions of our model over the course of the project, we selected five benchmarks from the 16 multimodal image/video and language benchmarks as well as ImageNet and Kinetics for classification as our development set (referred as DEV). To maximise its relevance, we choose the most challenging and widely studied benchmarks for captioning, visual question-answering and classification tasks on both images and videos.

Dataset splits for the DEV benchmarks. Concretely, estimating few-shot learning performance of a model consists of adapting it on a set of support samples and evaluating it on a set of query samples. As a result, any evaluation set should be composed of two disjoint subsets containing respectively the support and the query samples. For the DEV benchmarks that are used both to validate design decisions and hyperparameters, as well as to report final performance, we therefore use four subsets:

	Dataset	DEV	Gen.	Custom prompt	Task description	Eval set	Metric
	ImageNet-1k [94]	$\checkmark$			Object classification	Val	Top-1 acc.
	MS-COCO [15]	$\checkmark$	$\checkmark$		Scene description	Test	CIDEr
	VQAv2 [3]	$\checkmark$	$\checkmark$		Scene understanding QA	Test-dev	VQA acc. [3]
	OKVQA [69]	$\checkmark$	$\checkmark$		External knowledge QA	Val	VQA acc. [3]
	Flickr30k [139]		$\checkmark$		Scene description	Test (Karpathy)	CIDEr
	VizWiz [35]		$\checkmark$		Scene understanding QA	Test-dev	VQA acc. [3]
	TextVQA [100]		$\checkmark$		Text reading QA	Val	VQA acc. [3]
	VisDial [20]				Visual Dialogue	Val	NDCG
	HatefulMemes [54]			$\checkmark$	Meme classification	Seen Test	ROC AUC
Video	Kinetics700 2020 [102]	$\checkmark$			Action classification	Val	Top-1/5 avg
	VATEX [122]	$\checkmark$	$\checkmark$		Event description	Test	CIDEr
	MSVDQA [130]	$\checkmark$	$\checkmark$		Event understanding QA	Test	Top-1 acc.
	YouCook2 [149]		$\checkmark$		Event description	Val	CIDEr
	MSRVTTQA [130]		$\checkmark$		Event understanding QA	Test	Top-1 acc.
	iVQA [135]		$\checkmark$		Event understanding QA	Test	iVQA acc. [135]
	RareAct [73]			$\checkmark$	Composite action retrieval	Test	mWAP
	NextQA [129]		$\checkmark$		Temporal/Causal QA	Test	WUPS
	STAR [128]				Multiple-choice QA	Test	Top-1 acc.

Table 6: Summary of the evaluation benchmarks. DEV benchmarks were used to validate general design decision of the Flamingo models. Gen. stands for generative task where we sample text from the VLM. If a task is non-generative it means that we use the VLM to score answers among a given finite set. For most of our tasks we use a common default prompt, hence minimizing task-specific tuning (see Appendix B.1.5).

validation support: contains support samples for validation;
validation query: contains query samples for validation;
test support: contains support samples for final performance estimation;
test query: contains query samples for final performance estimation.

In practice, for the test query subset, we use the subset that prior works report results on, for apples-to-apples comparison. While the validation set would be a natural choice for the validation query subset, we note that this is not possible for all benchmarks, since some benchmarks do not have an official validation set (e.g. OKVQA) and for others, the validation is commonly used to report final performance in place of the test set (e.g. ImageNet or COCO). For simplicity, we use a subset of the original training set as the validation query subset. Finally, we also use additional disjoint subsets of the training set as respectively the validation support subset and the test support subset.

We now describe in more detail how we form the latter three subsets. For captioning tasks, open-ended evaluation is efficient so we evaluate on a large number of samples. Specifically, for COCO, we use the same number of samples as used in the Karpathy splits for evaluation sets (5000). For VATEX, because the training set is of limited size, we only evaluate over 1024 samples, reserving the rest for support sets. For question-answering tasks, we evaluate over 1024 samples; chosen to make both open- and close-ended evaluation reasonably fast. For image classification tasks, we evaluate over 10 images per class: 10,000 samples for ImageNet, and 7000 samples for Kinetics700. As for the support sets, for both validation and final performance estimation, we use 2048 samples across all tasks, except for classification tasks where we scale this to 32 samples per class, to better estimate expected performance for each class.

Unbiased few-shot performance estimation. Few-shot learning performance estimates on the DEV benchmarks may be biased, in the sense that over the course of this project, design decisions were made based on the performance obtained on these benchmarks. We note that this is the case for prior work which also make use of these benchmarks to validate and ablate their own design decisions. To account for this bias and provide unbiased few-shot learning performance estimates, we report performance on a remaining set of 11 benchmarks. Among those, some span the same open-ended image and video tasks as our DEV benchmarks (captioning and visual question-answering). But we also look at more specific benchmarks in order to explore less explored capabilities. These notably include: TextVQA [100] which specifically assesses OCR capabilities through question-answering;

VisDial [20], a visual dialogue benchmark; HatefulMemes [54] a vision and text classification benchmark; NextQA [129] which specially focuses on causality and temporal relation; STAR [128], a multiple-choice question answering task; and RareAct [73], a benchmark measuring compositionality in action recognition. We emphasize that we do not validate any design decisions on these benchmarks and use them solely to estimate unbiased few-shot learning performance after Flamingo training is done.

B.1.5 Few-shot learning evaluation hyperparameters

In few-shot learning, hyperparameter selection implicitly increases the number of shots as it requires additional validation examples. If those are not taken into account, as is often the case in practice, few-shot performance can be overestimated [80]. Similarly, cross-validation of benchmark-specific hyperparameters such as the prompt should be considered as a particularly basic few-shot learning method, where one selects the task-specific prompt over the set of shots. But other learning approaches might be more effective in making use of these labelled examples. Given the negative results reported by [80] in terms of the robustness of cross-validation and unless mentioned otherwise, all benchmarks are run using a single set of evaluation hyperparameters, including the prompts. We optimize hyperparameters jointly across the validation subsets of the DEV benchmarks and do not perform any benchmark-specific cross-validation of hyperparameters, aside from a few exceptions, as we detail next.

Except for HatefulMemes and RareAct, we always use the prompt ""Output: {output}" for all non-question-answering tasks, and "Question: {question} Answer: {answer}" for all question-answering / visual dialogue tasks. In particular, for VisDial [20], we use the previously described prompt to encode each questions/answers in the dialogue and the provided image caption is prepended to the dialogue history without any prompt. For HatefulMemes [54], we use a specific prompt to incorporate the OCR information provided as input which is: "is an image with written: "{meme_text}" on it. Is it hateful? Answer: {answer}", where the answer is either yes or no. Note that this is the only dataset where we explicitly provide OCR text "meme_text" as input to Flamingo models. For TextVQA, we do not make use of the provided OCR transcripts and instead directly rely on the off-the-shelf OCR capabilities of the Flamingo models. For RareAct, a zero-shot benchmark, we change the verb names to the third person, add an article before each noun and use the prompt "Caption: a person {verb + object}".

B.1.6 Dialogue prompt

For the dialogue examples in Figures 1 and 11, the following prompt is used:

This is a conversation between a human, User, and an intelligent
visual AI, Flamingo. User sends images, and Flamingo describes
them.
User: <a cat image>
Flamingo: That is a cat. It's a tiny kitten with really cute big
ears.
User: <a dinner image>
Flamingo: This is a picture of a group of people having dinner.
They are having a great time!
User: Can you guess what are they celebrating?
Flamingo: They might be celebrating the end of a successful
project or maybe a birthday?
User: <a graph image>
Flamingo: This is a graph, it looks like a cumulative density
function graph.

Model	Method	Prompt size	shots/class	ImageNet top 1	Kinetics700 avg top1/5
SotA	Fine-tuned	-	full	90.9 [127]	89.0 [134]
SotA	Contrastive	-	0	85.7 [82]	69.6 [85]
NFNetF6	Our contrastive	-	0	77.9	62.9
Flamingo-3B	RICES	8	1	70.9	55.9
		16	1	71.0	56.9
		16	5	72.7	58.3
Flamingo-9B	RICES	8	1	71.2	58.0
		16	1	71.7	59.4
		16	5	75.2	60.9
Flamingo-80B	Random	16	$\leq 0.02$	66.4	51.2
	RICES	8	1	71.9	60.4
		16	1	71.7	62.7
		16	5	76.0	63.5
	RICES+ensembling	16	5	77.3	64.2

Table 7: Few-shot results on classification tasks. The Flamingo models can also be used for standard classification tasks. In particular, we explore having access to support sets bigger than what our current prompt can accommodate (using up to 5000 support examples). In that regime, large gains are obtained by using the RICES method [136] as well as prompt ensembling. We also observe the same trend as with the vision-language benchmarks: bigger models do better and more shots help.

B. 2 Additional performance results

B.2.1 Few-shot learning on classification tasks

We consider applying the Flamingo models to well-studied classification benchmarks like ImageNet or Kinetics700. Results are given in Table 7. We observe a similar pattern as in other experiments: larger model tend to perform better. Second, given that few-shot classification tasks often come with more training examples (e.g., 1000 for ImageNet with 1 example per class), using methods to scale to larger support sets is beneficial. RICES (Retrieval In-Context Example Selection [136] described in Appendix A.2) performs substantially better than simply selecting examples randomly for inclusion in the prompt. Indeed, Flamingo achieves a $9.2 \%$ improvement in ImageNet classification when selecting 16 support examples out of 5000 using RICES, compared to choosing the same number of examples randomly. Ensembling multiple prompts further boosts results. However, note that Flamingo models underperform the current dominant contrastive paradigm for classification tasks; in particular, they underperform the very contrastive model used as their vision encoder (see Appendix D. 1 on Flamingo's limitations for more details). Finally, state-of-the-art zero-shot models on ImageNet such as BASIC [82] and LiT [146] are particularly optimized on classification tasks as they are trained on JFT-3B [145], a dataset with images and labels. Improving the performance of VLMs such as Flamingo on classification tasks is an interesting direction for future work.

B.2.2 Fine-tuning Flamingo as a pretrained vision-language model

To fine-tune Flamingo models on a downstream task, we train them on data batches from the task of interest in the same format as the single-image/video datasets described in Section 2.4.

Freezing and hyperparameters. When fine-tuning Flamingo, we keep the underlying LM layers frozen and train the same Flamingo layers as during pretraining. We also increase the resolution of the input images from $320 \times 320$ to $480 \times 480$ . Unlike in the pretraining phase, we also fine-tune the base visual encoder, finding that this typically improves results, likely due in part to the higher input resolution.

We choose certain hyperparameters on a per-task basis by grid search on a validation subset of the training set (or on the official or standard validation set where available). These hyperparameters include the learning rate (ranging from $3 \times 10^{-8}$ to $1 \times 10^{-5}$ ) and decay schedule (exponential decay by factors of $10 \times$ ), number of training steps, batch size (either 8 or 16), and whether visual data augmentation (color augmentation, random horizontal flips) is used.

Results. In Table 8, we present additional results for per-task Flamingo fine-tuning. When provided access to a large-scale task-specific dataset with many thousands of examples, we find that we can improve results over our previously presented in-context few-shot learning results, setting a new state of the art on five tasks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes. For example, on VQAv2, we observe improved results at $82.0 \%$ , outperforming our results achieved with 32 -shot in-context learning ( $67.3 \%$ ) as well as the previous state of the art ( $81.3 \%$ , Yan et al. [133]).

Although these fine-tuning results come at high computational cost relative to the previously presented in-context few-shot learning results - among other challenges like hyperparameter tuning - they further demonstrate the power of VLM pretraining for visual understanding even in the presence of large amounts of task-specific training data. In some cases our results likely trail the state of the art due in part to the fact that we simply optimise log-likelihood and do not make use of common task-specific metric optimisation tricks, such as CIDEr optimisation [64, 90] for COCO captioning, and fine-tuning on dense annotations for VisDial [79]. For example, Murahari et al. [79] report a $10 \%$ relative improvement in NDCG on VisDial from such dense annotation fine-tuning.

B.2.3 Zero-shot performance of the pretrained contrastive model

A crucial part of our approach is the Vision Encoder, pretrained separately using contrastive learning and kept frozen when training Flamingo models. We report zero-shot image classification results on ImageNet, Kinetics700 and retrieval results on Flick30K and COCO. The classification results are presented in Table 7 while the retrieval results are given in Table 9. For the retrieval tasks, our model outperforms the current state-of-the-art contrastive dual encoder approaches CLIP [85], ALIGN [50] and Florence [140]. However, we underperform the zero-shot state-of-the-art on Kinetics700 (CLIP) and the zero-shot state-of-the-art on ImageNet (BASIC). However, as noted earlier, BASIC [82] is particularly optimized for classification: it is trained on the JFT-3B [145] dataset which has images with labels rather than captions. We have noticed training on image and short text descriptions similar to labels significantly helps for ImageNet but is detrimental for retrieval benchmarks which require capturing rich scene descriptions instead. Since our goal is to use the Vision Encoder as a feature extractor for the Flamingo models in order to capture the whole scene and not just the main object, we favor retrieval metrics over classification ones. We provide more details about the contrastive pretraining in Appendix B.1.3.

Table 8: Comparison to SotA when fine-tuning Flamingo. We fine-tune Flamingo on all nine tasks where Flamingo was not SotA with few-shot learning. Flamingo sets a new SotA on five of these tasks sometimes even beating methods that resorts to known performance optimization tricks such as model ensembling (on VQAv2, VATEX, VizWiz and HatefulMemes). Best numbers among the restricted SotA are in bold. Best numbers overall are underlined. Restricted $\mathrm{SotA}^{\dagger}$ only includes methods that use a single model (not ensembles) and do not directly optimise the test metric (no CIDEr optimisation).

Method	test-dev	test-std	test	test	test-dev test-std		$\mathbb{L}$ $\mathbb{E}$ $\mathbb{Z}$ $\underline{\underline{\omega}}$ $\Sigma$ test	valid $\operatorname{test}$ -std		valid	valid	$\stackrel{\Delta}{\Delta}$ test-std	test seen
${ }^{3}$ Flamingo - 32 shots	67.6	-	113.8	65.1	49.8	-	31.0	56.8	-	86.8	36.0	-	70.0
SimVLM [124]	80.0	80.3	143.3	-	-	-	-	-	-	-	-	-	-
OFA [119]	79.9	80.0	149.6	-	-	-	-	-	-	-	-	-	-
Florence [140]	80.2	80.4	-	-	-	-	-	-	-	-	-	-	-
${ }^{3}$ Flamingo Fine-tuned	$\underline{82.0}$	$\underline{82.1}$	138.1	84.2	65.7	65.4	47.4	61.8	59.7	118.6	57.1	54.1	86.6
Restricted SotA ${ }^{\dagger}$	80.2 [140]	80.4 [140]	143.3 [124]	76.3 [153]	--	--	46.8 [51]	$\underline{75.2}$ [79]	74.5 [79]	138.7 [132]	54.7 [137]	73.7 [84]	79.1 [62]
Unrestricted SotA	81.3 [133]	81.3 [133]	149.6 [119]	81.4 [153]	57.2 [65]	60.6 [65]	--	--	$\underline{75.4}$ [123]	--	--	--	84.6 [152]

	Flickr30K						COCO
	image-to-text text-to-image						image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Florence [140]	90.9	99.1	-	76.7	93.6	-	64.7	85.9	-	47.2	71.4	-
ALIGN [50]	88.6	98.7	99.7	75.7	93.8	96.8	58.6	83.0	89.7	45.6	69.8	78.6
CLIP [85]	88.0	98.7	99.4	68.7	90.6	95.2	58.4	81.5	88.1	37.7	62.4	72.2
Ours	89.3	98.8	99.7	79.5	95.3	97.9	65.9	87.3	92.9	48.0	73.3	82.1

Table 9: Zero-shot contrastive pretraining evaluation. Zero-shot image-text retrieval evaluation of our pretrained contrastive model compared to the state-of-the-art dual encoder contrastive models.

	Ablated setting	Flamingo 3B value	Changed value	Param. count $\downarrow$	Step time $\downarrow$	COCO CIDEr $\uparrow$	OKVQA top $1 \uparrow$	VQAv2 top $1 \uparrow$	MSVDQA top1 $\uparrow$	VATEX CIDEr $\uparrow$	Overall score $\uparrow$
Flamingo 3B model (short training)				3.2 B	1.74s	86.5	42.1	55.8	36.3	53.4	70.7
(i)				3.1 B	1.58s	81.1	40.4	54.1	36.0	50.2	67.9
	Resampler size	Medium	Small Large	3.4B	1.87 s	84.4	42.2	54.4	35.1	51.4	69.0
(ii)	Multi-Img att.	Only last	All previous	3.2 B	1.74s	70.0	40.9	52.0	32.1	46.8	63.5
(iii)		0.5	0.0	3.2B	1.74s	85.0	41.6	55.2	36.7	50.6	69.6
				1.0	3.2 B	1.74s	81.3	43.3	55.6	36.8	52.7
(iv)	LM pretraining	MassiveText	C4	3.2B	1.74s	81.3	34.4	47.1	60.6	53.9	62.8
(v)	Freezing Vision	$\checkmark$	$\boldsymbol{\chi}$ (random init)	3.2 B	4.70s*	74.5	41.6	52.7	31.4	35.8	61.4
			$\boldsymbol{\chi}$ (pretrained)	3.2 B	4.70s*	83.5	40.6	55.1	34.6	50.7	68.1
(vi)	Co-train LM	$x$	$\checkmark$ (random init)	3.2 B	5.34s*	69.3	29.9	46.1	28.1	45.5	55.9
	on MassiveText		$\checkmark$ (pretrained)	3.2 B	5.34s*	83.0	42.5	53.3	35.1	51.1	68.6
(vii)	Dataset	M3W+ITP+VTP	LAION400M and CLIP	3.1 B	0.86s	61.4	37.9	50.9	27.9	29.7	54.7
	and Vision encoder	and NFNetF6	M3W+LAION400M+VTP and CLIP	3.1 B	1.58s	76.3	41.5	53.4	32.5	46.1	64.9

Table 10: Additional ablation studies. Each row in this ablation study table should be compared to the baseline Flamingo run reported at the top of the table. The step time measures the time spent to perform gradient updates on all training datasets. (*): Due to higher memory usage, these models were trained using four times more TPU chips. The obtained accumulation step time was therefore multiplied by four.

B. 3 Extended ablation studies

B.3.1 Flamingo

Ablation study experimental setup. As in Table 10, we report per-task results and the Overall score (see Section 3.3) for Flamingo-3B on the validation subsets of the 5 DEV multimodal benchmarks with 4 shots in Table 10. We perform the ablation using batch size of 256 for $M 3 W, 512$ for ALIGN, 512 for LTIP and 64 for VTP. Models are trained for 1 million gradient steps (meaning 250,000 gradient updates, for the base model as we accumulate gradients over four datasets).

Resampler size. We further investigate the architectural design of the Resampler in row (i) of Table 10. We ablate the size of our Resampler with three options: Small, Medium (default value for all Flamingo models), and Large. We see that the best performance is achieved with a medium size Resampler. Moreover, when scaled together with the frozen LM, we observed that increasing the size of the Perceiver Resampler lead to unstable training. We thus made a conservative choice to keep the same medium Resampler size for all our Flamingo models.

Effect of how many images are cross-attended to. In the interleaved image-text scenario, we ablate whether the model can only attend to the single most recent previous image, or to all the previous images (row (ii) of Table 10). We can see that the single image case leads to significantly better results ( $7.2 \%$ better in the overall score). One potential explanation is that when attending to all previous images, there is no explicit way of disambiguating between different images in the cross-attention inputs. Nonetheless, recent work has shown that such disambiguation is still possible implicitly through the causal attention mechanism [36]. We also explored more explicit ways to enable this while attending to all previous images by modifying the image tags to include an index (<image 1>, <image 2>, etc.) and/or learning absolute index embeddings added to the cross-attention features for each image. These strategies were not as robust as our method when the number of images per sequence changes between training and test time. Such a property is desirable to reduce the number of images per sequence during training for better efficiency (we use $N=5$ at training time) while still generalizing to many images for few-shot evaluation (we go up to $N=32$ at test time). For these reasons, we keep the single image cross-attention strategy for the Flamingo models. Note that while the model cannot explicitly attend to all previous images due to this masking strategy, it can still implicitly attend to them from the language-only self-attention that propagates all previous images' features via the previous text tokens.

M3W image placement data augmentation. Given a webpage, we don't know in advance if the text of the page will mention the previous or the next image in the two-dimensional layout of the page DOM. For this reason, we explore a data augmentation on $M 3 W$ controlled by $p_{\text {next }}$ which indicates whether a given text token attends to the previous or the next image (see more details in Appendix A.3.2). The default value $p_{\text {next }}=\frac{1}{2}$ means that for each webpage sampled, we decide uniformly at random whether the model attends to the previous or next image. $p_{\text {next }}=0$ means the model always attends to the previous image while $p_{\text {next }}=1$ means the model always attends to the following image. The results (row (iii) of Table 10) show that using this randomization is beneficial.

Language model pretraining. To measure the importance of text pretraining, we compare the performance of using a frozen decoder-only Transformer either pretrained on MassiveText (our main model) or pretrained on the C4 dataset [87] (row (iv) of Table 10). Using the C4 dataset (which is smaller and less filtered than MassiveText) for training leads to a significant loss in performance ( $-7.9 \%$ overall). We note that the performance notably decreases for tasks that involve more language understanding such as visual question-answering tasks (OKVQA, VQAv2 and MSVDQA) while it remains on par for tasks that do not require as much language understanding (COCO, VATEX). This highlights the importance of pretraining the LM on a high-quality text-only dataset.

Freezing the vision encoder. During Flamingo training, we freeze the pretrained components (Vision Encoder and LM layers) while training newly added components from scratch. We ablate in (v) of Table 10 this freezing decision by training the Vision Encoder weights either from scratch or initialized with the contrastive vision-language task. If trained from scratch, we observe that the performance decreases by a large margin of $-9.3 \%$ . Starting from pretrained weights still leads to a drop in performance of $-2.6 \%$ while also increasing the compute cost of the training.

Alternative to freezing the LM by co-training on MassiveText. Another approach for preventing catastrophic forgetting is to co-train on MassiveText [86], the dataset that was used to pretrain the language model. Specifically, we add MassiveText to the training mixture, with a weight $\lambda_{m}$ of 1.0 (best performing after a small grid search), using a sequence length of 2048 and the exact same setting as the pretraining of Chinchilla [42] for computing the text-only training loss. In order to co-train on MassiveText, we need to unfreeze the language model but we keep the vision encoder frozen. We perform two ablations in row (vi) of Table 10: starting from a pretrained language model (with a learning rate multiplier of 0.1 of the LM weights) versus initializing from scratch (with the same learning rate everywhere). In both cases, the overall scores are worse than our baseline which starts from the language model, pretrained on MassiveText, and is kept frozen throughout training. This indicates that the strategy of freezing the language model to avoid catastrophic forgetting is beneficial. Even more importantly, freezing the LM is computationally cheaper as no gradient updates of the LM weights are required and we do not need to train on an additional dataset. This computational argument is even more relevant for our largest model, Flamingo-80B, where we freeze almost $90 \%$ of the overall weights.

Additional experiments using the LAION400M dataset. In order to provide reference numbers that are more easily reproducible using publicly available datasets and network weights we also provide two additional ablations using the CLIP ViT L-14 weights [85] and the LAION400M dataset [96] in rows (vii) of Table 10.

B.3.2 Dataset mixing strategies for the contrastive pretraining

One key to achieving strong results was the inclusion of our new dataset LTIP alongside ALIGN for training. Despite being a smaller dataset ALIGN by a factor of 6, a contrastive model trained on only LTIP outperforms one trained only on ALIGN on our evaluation metrics, suggesting that dataset quality may be more important than scale in the regimes in which we operate. We also find that a

Dataset	Combination strategy	ImageNet accuracy top-1	COCO
			image-to-text			text-to-image
			R@1	R@5	R@10	R@1	R@5	R@10
LTIP	None	40.8	38.6	66.4	76.4	31.1	57.4	68.4
ALIGN	None	35.2	32.2	58.9	70.6	23.7	47.7	59.4
LTIP + ALIGN	Accumulation	45.6	42.3	68.3	78.4	31.5	58.3	69.0
LTIP + ALIGN	Data merged	38.6	36.9	65.8	76.5	15.2	40.8	55.7
LTIP + ALIGN	Round-robin	41.2	40.1	66.7	77.6	29.2	55.1	66.6

Table 11: Effect of contrastive pretraining datasets and combination strategies. The first two rows show the effect of training a small model on LTIP and ALIGN only; the final three show the results of a small model trained on combinations of these datasets, comparing different combination strategies. model trained on both ALIGN and LTIP outperforms those trained on the two datasets individually and that how the datasets are combined is important.

To demonstrate this, we train a small model with an NFNet-F0 vision encoder, BERT-mini language encoder and batch size 2048 for 1 million gradient-calculation steps on ALIGN, LTIP and a mixture of the two. The results are presented in Table 11. It shows the results of training models on the combined datasets using three different merging regimes:

Data merged: Batches are constructed by merging examples from each dataset into one batch.
Round-robin: We alternate batches of each dataset, updating the parameters on each batch.
Accumulation: We compute a gradient on a batch from each dataset. These gradients are then weighted and summed and use to update the parameters.

Across all evaluation metrics, we find that the Accumulation method outperforms other methods of combining the datasets. Although the LTIP dataset is $5 \times$ smaller than the ALIGN dataset, this ablation study suggests that the quality of the training data can be more important than its abundance.

C Qualitative results

In addition to the samples in Figure 1, in this section we provide selected samples covering different interaction modalities in Figures 10, 11, and 12. Unlike the quantitative benchmark results which use beam search with a beam width of 3 for decoding, all qualitative results presented in this section use greedy decoding for faster sampling.

Figure 10 shows the simplest form of interaction where a single image is provided followed by a text prompt either in the form of a question or the start of a caption. Even though the model is not trained specifically for the question and answer format, the capabilities of the pretrained language model allows this adaptation. In many of these examples, Flamingo can do at least one step of implicit inference. Some of the objects are not named in the prompt but their properties are queried directly. Based on its visual input, the model manages to recall the knowledge relevant to the referred object and thus produces the correct answer. Vision networks trained contrastively have been shown to learn character recognition capabilities [85]. We observe that Flamingo preserves this capability in the full model, in some cases for text that is rather small with respect to the size of the image. Since our model can accept inputs in the form of arbitrary sequences of visuals and language, we test its abilities to hold an extended dialogue with interleaved images and text. Figure 11 shows some samples which are generated by prompting the model with a brief dialogue (Appendix B.1.6) followed by user interaction including image insertions. Even after several rounds of interaction Flamingo can still successfully attend to the image and reply to questions that can not be guessed by language alone. We observe that multiple images can be separately attended: simple comparisons and inferences are handled properly. Lastly, we investigated similar capabilities with video inputs as they present some extra challenges compared to images. Figure 12 shows some selected samples. As seen in the figure, in some cases

Flamingo can successfully integrate information from multiple frames (e.g., videos scanning through a scene or text) and answer questions involving temporal understanding (e.g., in the last example, with the word "after").

D Discussion

D. 1 Limitations, failure cases and opportunities

Here, we describe some limitations and failure cases of our models, as well as opportunities for further improving our models and extending their abilities.

Classification performance. Although our visual language models have important advantages over contrastive models (e.g., few-shot learning and open-ended generation capabilities), their performance lags behind that of contrastive models on classification tasks. We believe this is because the contrastive training objective directly optimizes for text-image retrieval, and in practice, the evaluation procedure for classification can be thought of as a special case of image-to-text retrieval [85]. This is not the case for the language modeling objective we use to train our visual language models and this may contribute to the observed performance gap on classification tasks. In particular, Zhao et al. [148] have shown that language models suffer from various biases arising from the training data distribution, the set of samples used in the prompt, and their order. They also show that such issues can be mitigated with calibration techniques, provided one can assume a certain prior distribution (e.g., uniform) over the label space. This assumption doesn't hold in general, and further research is needed to develop techniques to address these issues in the few-shot setting. More generally, seeking objectives, architectures, or evaluation procedures that could bridge the gap between these two classes of models is a promising research direction.

Legacies of language models. Our models build on powerful pretrained causal language models, and as a side effect, directly inherit their weaknesses. For instance, causal modeling of the conditioning inputs is strictly less expressive than bidirectional modeling. In this direction, recent work has shown that non-causal masked language modeling adaptation [120] followed by multitask fine-tuning [95, 125, 131] can efficiently improve the zero-shot performance of causal decoder-only language models. Furthermore, transformer-based language models tend to generalize poorly to test sequences significantly longer than the training ones [83]. In settings where the expected text output is too long, the ability of the models to leverage enough shots for few-shot learning can be affected. For instance, for the VisDial dataset [20], a single shot consists of an image followed by a long dialogue composed of 21 different sentences. A sequence of 32 VisDial shots is thus composed of at least $32 \times 21=672$ sentences, which in practice means that the prompt length ranges from 4096 to 8192 tokens. This is significantly longer than the maximum sequence length (2048) our LMs have been trained on [42]. To this end, we have capped our reported results on VisDial at 16 shots. On another note, while our ablations demonstrate the importance of the language model priors inherited from frozen language models, we suspect that they may play a role in occasional hallucinations and ungrounded guesses observed in open-ended dialogue settings. We provide and analyze examples of such behaviours in Figure 13. Finally, language modeling suffers from poor sample efficiency during pretraining [11]. Mitigating this issue has the potential to greatly accelerate progress in the field, by improving turnaround of large-scale training runs and in turn increasing the feasibility of more systematic exploration of design decisions at larger scales. Further discussion on typical weaknesses observed for large LMs can be found in [11, 86].

Trade-offs of few-shot learning methods. In the paper, we use in-context learning as our "go-to" few-shot learning method (see Section 2.5). This method has notable advantages over gradient-based approaches such as fine-tuning. Indeed, in-context learning requires almost no hyperparameter tuning, works reasonably well in the very low data regime (dozens of examples), and only requires inference, simplifying deployment. In contrast, gradient-based approaches require carefully tuned design choices to avoid overfitting (either by proper learning rate schedule or architecture design [43]) and often need more data (thousands) to work well. This motivated our focus on in-context learning; however, this approach also has drawbacks we discuss next. Inference compute cost. The compute cost of in-context learning with transformer models scales linearly with the number of shots if one can reuse the few-shot prompt for multiple query samples

Figure 10: Selected single image samples. Gray boxes are user input and the pink boxes are Flamingo output.

Figure 11: Selected dialogue samples. Gray boxes are user input and the pink boxes are Flamingo output. For dialogue, Flamingo is provided with a custom prompt (hidden from the visualization but shown in Appendix B.1.6) containing a dialogue with 3 corresponding images, but it is not fine-tuned for dialogue in any other way.

Figure 12: Selected video samples. These are all of the frames the model sees. (Best viewed with zoom.) (by caching the keys and values) and quadratically otherwise. In contrast, gradient-based few-shot learning approaches [43] have constant complexity with respect to the number of shots during inference.

Prompt sensitivity. In-context learning has also been shown to be disconcertingly sensitive to various aspects of the demonstrations, such as the order of the samples [148] or their format. Leveraging more shots. When using in-context learning, performance plateaus rapidly as the number of few-shot samples increases beyond 32. This proves a striking contrast with typical gradient-based methods, for which the amount of correctly paired training data is a critical factor for performance. We note that RICES (Retrieval In-Context Example Selection [136] described in Appendix A.2) effectively mitigates this issue for classification tasks (Appendix B.2.1), but still faces similar issues beyond a small number of example per class.

Task location. Recent work on understanding what makes in-context learning effective sheds some light on a possible explanation for why more shots do not always help [76, 92]. In more detail, Brown et al. [11] raise the question of whether in-context learning actually "learns" new tasks at inference time based on the provided input-output mappings, or simply recognizes and identifies tasks learned during training. On this question, the findings of Reynolds and McDonell [92] suggest that the latter is the key driver of performance across diverse settings, and refer it as task location. Similarly, Min et al. [76] show that the mapping from input to output generally has limited impact on few-shot performance, as opposed to specifying the overall format of the examples. In line with these findings, we also observe non-trivial zero-shot performance using prompt without any images, hence also highlighting that the format of the task matters significantly. Intuitively, a handful of samples may often be enough to perform task location well, but the model may generally not be able to leverage further samples at inference time to refine its behaviour.

Figure 13: Hallucinations and ungrounded guesses in open-ended visual question answering. Left: The model occasionally hallucinates by producing answers that seem likely given the text only, but are wrong given the image as additional input. Middle: Similar hallucinations can be provoked by adversarially prompting the model with an irrelevant question. Right: A more common pitfall arises when the model makes ungrounded guesses when the answer cannot be determined based on the inputs. Few-shot examples and more sophisticated prompt design may be used to mitigate these issues. More broadly, addressing these issues is an important research direction towards improving our models' applications in open-ended visual dialogue settings.

In summary, there is no "golden" few-shot method that would work well in all scenarios. In particular, the best choice of few-shot learning approach strongly depends on characteristics of the application, an important one being the number of annotated samples. On this point, in our work, we demonstrate that in-context learning is highly effective in the data-starved regime ( 32 samples or fewer). There may be opportunities to combine different methods to leverage their complementary benefits, in particular when targeting less data-constrained data regimes (e.g., hundreds of samples).

Extending the visual and text interface. Natural language is a powerful and versatile input/output interface to provide descriptions of visual tasks to the model and generate outputs or estimate conditional likelihoods over possible outputs. However, it may be a cumbersome interface for tasks that involve conditioning on or predicting more structured outputs such as bounding boxes (or their temporal and spatio-temporal counterparts); as well as making spatially (or temporally and spatio-temporally) dense predictions. Furthermore, some vision tasks, such as predicting optical flow, involve predicting in continuous space, which is not something our model is designed to handle out of the box. Finally, one may consider additional modalities besides vision that may be complementary, such as audio. All of these directions have the potential to extend the range of tasks that our models can handle; and even improve performance on the ones we focus on, thanks to synergies between the corresponding abilities.

Scaling laws for vision-language models. In this work, we scale Flamingo models up to 80B parameters and provide some initial insights on their scaling behaviour across evaluation benchmarks, summarized in Figure 2. In the language space, an important line of work has focused on establishing scaling laws for language models [42, 53]. In the vision domain, Zhai et al. [145] take a step in this direction. Similar efforts have yet to be made for vision-language models, including contrastive models, as well as visual language models such as the ones we propose. While language modeling scaling law research has focused on perplexity as the golden metric, we speculate that it may be more directly useful for our purposes to establish such trends in terms of aggregate downstream evaluation task performance.

D. 2 Benefits, risks and mitigation strategies

D.2.1 Benefits

Accessibility. A system like Flamingo offers a number of potential societal benefits, some of which we will discuss in this section. Broadly, the fact that Flamingo is capable of task generalisation makes it suitable for use cases that have not been the focus of vision research historically. Typical vision systems are trained to solve a particular problem by training on large databases of manually annotated task-specific examples, making them poorly suited for applications outside of the narrow use cases for which they were deliberately trained. On the other hand, Flamingo is trained in a minimally constrained setting, endowing it with strong few-shot task induction capabilities. As we've shown in our qualitative examples (Appendix C), Flamingo can also be used through a "chat"-like interface for open-ended dialogue. Such capabilities could enable non-expert end users to apply models like Flamingo even to low-resource problems for which little to no task-specific training data has been collected, and where queries might be posed in a variety of formats and writing styles. In this direction, we have shown that Flamingo achieves strong performance on the VizWiz challenge ${ }^{1}$ , which promotes visual recognition technologies to assist visually impaired people. A dialogue interface could also promote better understanding and interpretability of visual language models. It could help highlight issues with bias, fairness, and toxicity the model may pick up on from the training data. Overall, we believe that Flamingo represents an important step towards making state-of-the-art visual recognition technology more broadly accessible and useful for many diverse applications.

Model recycling. From a modeling perspective, although Flamingo is computationally expensive to train, it importantly leverages pretrained frozen language models and visual encoders. We demonstrated that new modalities can be introduced into frozen models, thereby avoiding expensive retraining. As such models continue to grow in size and computational demands, "recycling" them will become increasingly important from an environmental perspective (as well as a practical one), as described in Larochelle [55] and explored in Strubell et al. [105] for language models. We hope such results may inspire further research into how existing models can be repurposed efficiently rather than trained from scratch.

D.2.2 Risks and mitigation strategies

This section provides some early investigations of the potential risks of models like Flamingo. This study is preliminary and we foresee that further research efforts should be undertaken to better assess those risks. We also discuss potential mitigation strategies towards safely deploying these models. Note that as explained in our Model Card [77] in Appendix E, this model was developed for research purposes only and should not be used in specific applications before proper risk analyses are conducted and mitigation strategies are explored.

By construction, Flamingo inherits the risks of Large LMs. Recall that a large part of our model is obtained by freezing the weights of an existing language model [42]. In particular, if provided with no images Flamingo falls back to language model behavior. As such Flamingo is exposed to the same risks of large language models: it can output potentially offensive language, propagate social biases and stereotypes, as well as leaking private information [126]. In particular, we refer to the analysis presented in the Chinchilla paper (Hoffmann et al. [42], Section 4.2.7) in terms of gender bias on the Winogender dataset [93] which demonstrate that even though this model is less biased towards gender than previous models [86], gender biases are still present. In terms of unprompted toxicity, we also refer to the analysis from Chinchilla [42] which highlights that overall the propensity of the model to produce toxic outputs when not prompted to do so is rather low, as measured by computing the PerspectiveAPI toxicity score on 25,000 samples. Weidinger et al. [126] detail possible long-term mitigation strategies for these risks. They include social or public policy interventions, such as the creation of regulatory frameworks and guidelines; careful product design, for instance relating to user interface decisions; and research at the intersection between AI Ethics and NLP, such as building better benchmarks and improving mitigation strategies. In the short term, effective approaches include relying on prompting to mitigate any biases and harmful outputs [86]. Next, we explore the additional risks incurred by Flamingo's additional visual input capabilities.

¹| | CIDEr difference | | CIDER overall | | :--- | :--- | :--- | :--- | | | female - male $=\Delta$ | darker - lighter $=\Delta$ | | | AoANet [46] | - | +0.0019 | 1.198 | | Oscar [61] | - | +0.0030 | 1.278 | | Flamingo, 0 shot | $0.899-0.870=+0.029(p=0.52)$ | $0.955-0.864=+0.091(p=0.25)$ | 0.843 | | Flamingo, 32 shots | $1.172-1.142=+0.030(p=0.54)$ | $1.128-1.152=-0.025(p=0.76)$ | 1.138 |

Table 12: Bias evaluation of Flamingo for COCO captioning. We report results on the COCO dataset splits over gender and skin tone provided by Zhao et al. [147].

Gender and racial biases when prompted with images. Previous work has studied biases that exist in captioning systems [37, 147]. Such modeling biases can result in real-world harms if deployed without care. For AI systems to be useful to society as a whole, their performance should not depend on the perceived skin tone or gender of the subjects - they should work equally well for all populations. However, current automatic vision system performance has been reported to vary with race, gender or when applied across different demographics and geographic regions [12, 21, 97]. As a preliminary study assessing how Flamingo's performance varies between populations, we follow the study proposed in Zhao et al. [147] and report how the captioning performance of our model varies on COCO as a function of gender and race. Note that we use a different evaluation protocol from the one proposed by Zhao et al. [147]; in that work, they measure results across 5 pretrained models and compute confidence intervals across aggregated per-model scores. Here, we have just one copy of our model (due to its high training cost), and we instead perform statistical tests on the per-sample CIDEr scores across the splits from Zhao et al. [147]. We report the results in Table 12.

Overall, when comparing the CIDEr scores aggregated among images labeled as female versus male, as well as when comparing darker skin versus lighter skin, we find there are no statistically significant differences in the per-sample CIDEr scores. To compare the two sets of samples, we use a two-tailed $t$ -test with unequal variance, and among the four comparisons considered, the lowest $p$ -value we find is $p=0.25$ , well above typical statistical significance thresholds (e.g. a common rejection threshold might be $p<\alpha=0.05$ ). This implies that the differences in scores are indistinguishable from random variation under the null hypothesis that the mean scores are equal. We note that a failure to reject the null hypothesis and demonstrate a significant difference does not imply that there are no significant differences; it is possible that a difference exists that could be demonstrated with larger sample sizes, for example. However, these preliminary results are nonetheless encouraging.

Toxicity when prompted with images. We also evaluate the toxicity of Flamingo using the Perspective $A P I^{2}$ to evaluate the toxicity of the model's generated captions when prompted with images from the COCO test set. We observe that some captions are labelled as potentially toxic by the classifier; however, when examining them manually, we do not observe any clear toxicity - output captions are appropriate for the images provided. Overall, based on our own experiences interacting with the system throughout the course of the project, we have not observed toxic outputs when given "safe-for-work" imagery. However this does not mean the model is incapable of producing toxic outputs, especially if probed with "not-safe-for-work" images and/or toxic text. A more thorough exploration and study would be needed if such a model were put in production.

Applying Flamingo for mitigation strategies. Thanks to its ability to rapidly adapt in lowresource settings, Flamingo could itself be applied in addressing some of the issues described above. For instance, following Thoppilan et al. [111], adequately conditioned or fine-tuned Flamingo models could be used for filtering purposes of toxic or harmful samples in the training data. In their work, they observe significant improvements relating to safety and quality when fine-tuning on the resulting data. Furthermore, during evaluation, such adapted models could be used to down-rank or exclude outputs that might be classified as offensive, promoting social biases and stereotypes or leaking private information, thus accelerating progress in this direction even for low-resource tasks. Our results on the HatefulMemes benchmark represent a promising step in this direction. Recent work in the language modeling space has also shown success in training an LM to play the role of a "red team" and generate test cases, so as to automatically find cases where another target LM behaves in a harmful way [81]. A similar approach could be derived for our setting. Enabling the model to

²support outputs with reference to particular locations within the visual inputs, or to external verified quotes is also an interesting direction [72, 111]. Finally, in Figure 11, we provide qualitative examples demonstrating that Flamingo can explain its own outputs, suggesting avenues to explainability and interpretability using the model's text interface.

E Flamingo Model Card

We present a model card for Flamingo in Table 13, following the framework presented by Mitchell et al. [77].

Model Details
Model Date	March 2022
Model Type	Transformer-based autoregressive language model, conditioned on visual features from a convnet-based encoder. Additional transformer-based cross-attention layers incorporate vision features into the language model's text predictions. (See Section 2 for details.)
Intended Uses
Primary Intended Uses	The primary use is research on visual language models (VLM), including: research on VLM applications like classification, captioning or visual question answering, understanding how strong VLMs can contribute to AGI, advancing fairness and safety research in the area of multimodal research, and understanding limitations of current large VLMs.
Out-of-Scope Uses	Uses of the model for visually conditioned language generation in harmful or deceitful settings. Broadly speaking, the model should not be used for downstream applications without further safety and fairness mitigations specific to each application.
Factors
Card Prompts - Relevant Factor	Relevant factors include which language is used. Our model is trained on English data. Our model is designed for research. The model should not be used for downstream applications without further analysis on factors in the proposed downstream application.
Card Prompts - Evaluation Factors	Flamingo is based on Chinchilla (a large proportion of the weights of Chinchilla are used as this) and we refer to the analysis provided in [42, 86] for the language only component of this work. We refer to our study presented in Appendix D.2.2 for a toxicity analysis when the model is conditioned on an image.
Metrics

Model Performance Measures	We principally focus on the model's ability to predict relevant language when given an image. For that we used a total of 18 different benchmarks described in Appendix B.1.4 spanning various vision and language tasks such as classification (ImageNet, Kinetics700, HatefulMemes), image and video captioning (COCO, VATEX, Flickr30K, YouCook2, RareAct), visual question answering (OKVQA, VizWiz, TextVQA, VQAv2, MSRVTTQA, MSVDQA, iVQA, STAR, NextQA) and visual dialog (VisDiag). This was tested either in an open ended setting where Flamingo generate language and we compare the outputs with the ground truth or in a close ended setting where we directly score various outcomes using the likelihood of the model.
Decision thresholds	N/A
Approaches to Uncertainty and Variability	Due to the costs of training Flamingo, we cannot train it multiple times. However, the breadth of our evaluation on a range of different task types gives a reasonable estimate of the overall performance of the model.
Evaluation Data
Datasets	See Table 6 for a detailed list.
Motivation	We chose our evaluation datasets to span an important range of vision and language tasks to correctly assess the ability of Flamingo to produce relevant text given an image.
Preprocessing	Input text is tokenized using a SentencePiece tokenizer with a vocabulary size of 32,000 . Images are processed so that their mean and variance are 0 and 1 respectively.
Training Data
See [50], the Datasheet in Appendix F.1, Appendix F.2.1, Appendix F.2.2
Quantitative Analyses
Unitary Results	Flamingo sets a new state of the art in few-shot learning on a wide range of open-ended vision and language tasks. On the 16 tasks we consider, Flamingo also surpasses the fine-tuned state-of-art in 6 of the cases despite using orders of magnitude less task-specific training data. We refer to Section 3 for the full details of our quantitative study.
Intersectional Results	We did not investigate intersectional biases.
Ethical Considerations
Data	The data is sourced from a variety of sources, some of it from web content. Sexually explicit content is filtered out, but the dataset does include racist, sexist or otherwise harmful content.
Human Life	The model is not intended to inform decisions about matters central to human life or flourishing.

Mitigations	Apart from removing sexual explicit content we did not filter out toxic content, following the rationale of Rae et al. [86]. More work is needed on mitigation approaches to toxic content and other types of risks associated with language models, such as those discussed in Weidinger et al. [126].
Risks and Harms	The data is collected from the internet, and thus undoubtedly toxic and biased content is included in our training dataset. Furthermore, it is likely that personal information is also in the dataset that has been used to train our models. We defer to the more detailed discussion in Weidinger et al. [126].
Use Cases	Especially fraught use cases include the generation of factually incorrect information with the intent of distributing it or using the model to generate racist, sexist or otherwise toxic text with harmful intent. Many more use cases that could cause harm exist. Such applications to malicious use are discussed in detail in Weidinger et al. [126].

Table 13: Flamingo Model Card. We follow the framework presented in Mitchell et al. [77].

F Datasheets

F. 1 M3W dataset

We follow the framework defined by Gebru et al. [30] and provide the datasheet for $M 3 W$ in Table 14.

Motivation
For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?	The dataset was created for pre-training vision-language models and was created by researchers and engineers.
Any other comments?	None.
Composition
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?	All instances of the dataset are documents from the web containing interleaved text and images.
How many instances are there in total (of each type, if appropriate)?	There are 43.3 M instances (documents) in total, with a total of 185 M images and 182 GB of text.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?	The dataset is a sample from a larger set.
What data does each instance consist of?	Each instance is made up of a sequence of UTF-8 bytes encoding the document's text, as well as a sequence of integers indicating the positions of images in the text, and the images themselves in compressed format (see Section 2.4).
Is there a label or target associated with each instance?	No, there are no labels associated with each instance.

Is any information missing from individual instances?	No.
Are relationships between individual instances made explicit?	There are no relationships between the different instances in the dataset.
Are there recommended data splits?	We use random splits for the training and development sets.
Are there any errors, sources of noise, or redundancies in the dataset?	There is significant redundancy at the sub-document level.
Is the dataset self-contained, or does it link to or otherwise rely on external resources?	The dataset is self-contained.
Does the dataset contain data that might be considered confidential?	No.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?	The dataset likely contains some data that might be considered offensive, insulting or threatening, as such data is prevalent on the web. We do not try to filter out such content, with the exception of explicit content, which we identify using dedicated filter.
Collection Process
How was the data associated with each instance acquired?	The data is available publicly on the web.
What mechanisms or procedures were used to collect the data?	The data was collected using a variety of software programs to extract and clean the raw text and images.
If the dataset is a sample from a larger set, what was the sampling strategy?	We randomly subsample documents.
Over what timeframe was the data collected?	The dataset was collected over a period of several months in 2021. We do not filter the sources based on creation date.
Were any ethical review processes conducted?	No.
Preprocessing/cleaning/labeling
Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?	Yes - the pre-processing details are discussed in Appendix A.3.1.
Is the software used to preprocess/clean/label the instances available?	No.
Uses
Has the dataset been used for any tasks already?	Yes, we use the dataset for pre-training multimodal language and vision models.
Is there a repository that links to any or all papers or systems that use the dataset?	No, the dataset has only been used to train the models in this paper.

What (other) tasks could the dataset be used for?	We do not foresee other usages of the dataset at this stage.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?	The dataset is static and thus will become progressively more "stale". For example, it will not reflect new language and norms that evolve over time. However, due to the nature of the dataset it is relatively cheap to collect an up-to-date version.
Are there tasks for which the dataset should not be used?	The dataset described in this paper contains English language text almost exclusively and therefore should not be used for training models intended to have multilingual capabilities.
Distribution
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?	No.

Table 14: M3W Datasheet. We follow the framework as presented by Gebru et al. [30].

F. 2 Image and video text pair datasets

F.2.1 Datasheet for LTIP

Motivation
For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?	The dataset was created for pre-training vision-language models and was created by researchers and engineers.
Any other comments?	None.
Composition
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?	All instances of the dataset are image-text pairs.
How many instances are there in total (of each type, if appropriate)?	The dataset contains 312M image-text pairs.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?	The dataset is a sample from a larger set.
What data does each instance consist of?	Each instance is made up of a sequence of UTF-8 bytes encoding the document's text, and an image in compressed format (see Appendix A.3.3).
Is there a label or target associated with each instance?	No, there are no labels associated with each instance.
Is any information missing from individual instances?	No.
Are relationships between individual instances made explicit?	There are no relationships between the different instances in the dataset.
Are there recommended data splits?	We use random splits for the training and development sets.
Are there any errors, sources of noise, or redundancies in the dataset?	The data is relatively high quality but there is a chance that some instances are repeated multiple times.
Is the dataset self-contained, or does it link to or otherwise rely on external resources?	The dataset is self-contained.
Does the dataset contain data that might be considered confidential?	No.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?	The websites that were used for this dataset were carefully selected to avoid such content. However given the scale of the data it is possible that some data could be considered offensive or insulting.
Collection Process
How was the data associated with each instance acquired?	The data is available publicly on the web.
What mechanisms or procedures were used to collect the data?	The data was collected using a variety of software programs to extract and clean the raw text and images.

If the dataset is a sample from a larger set, what was the sampling strategy?	N.A.
Over what timeframe was the data collected?	The dataset was collected over a period of several months in 2021. We do not filter the sources based on creation date.
Were any ethical review processes conducted?	No.
Preprocessing/cleaning/labeling
Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?	Some automatic text formatting was applied to remove from the captions dates and locations that were not relevant to the training objective.
Is the software used to preprocess/clean/label the instances available?	No.
Uses
Has the dataset been used for any tasks already?	Yes, we use the dataset for pre-training multimodal language and vision models.
Is there a repository that links to any or all papers or systems that use the dataset?	No, the dataset has only been used to train the models in this paper.
What (other) tasks could the dataset be used for?	We do not foresee other usages of the dataset at this stage.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?	The dataset is static and thus will become progressively more "stale". For example, it will not reflect new language and norms that evolve over time. However, due to the nature of the dataset it is relatively cheap to collect an up-to-date version.
Are there tasks for which the dataset should not be used?	The dataset described in this paper contains English language text almost exclusively and therefore should not be used for training models intended to have multilingual capabilities.
Distribution
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?	No.

Table 15: LTIP Datasheet. We follow the framework as presented by Gebru et al. [30].

F.2.2 Datasheet for VTP

Motivation
For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?	The dataset was created for pre-training vision-language models and was created by researchers and engineers.
Any other comments?	None.
Composition
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?	All instances of the dataset are video-text pairs.
How many instances are there in total (of each type, if appropriate)?	The dataset contains 27M video-text pairs.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?	The dataset is a sample from a larger set.
What data does each instance consist of?	Each instance is made up of a sequence of UTF-8 bytes encoding the document's text, and a video in compressed format (see Appendix A.3.3).
Is there a label or target associated with each instance?	No, there are no labels associated with each instance.
Is any information missing from individual instances?	No.
Are relationships between individual instances made explicit?	There are no relationships between the different instances in the dataset.
Are there recommended data splits?	We use random splits for the training and development sets.
Are there any errors, sources of noise, or redundancies in the dataset?	The data is relatively high quality but there is a chance that some instances are repeated multiple times.
Is the dataset self-contained, or does it link to or otherwise rely on external resources?	The dataset is self-contained.
Does the dataset contain data that might be considered confidential?	No.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?	The websites that were used for this dataset were carefully selected to avoid such content. However given the scale of the data it is possible that some data could be considered offensive or insulting.
	Collection Process
How was the data associated with each instance acquired?	The data is available publicly on the web.
What mechanisms or procedures were used to collect the data?	The data was collected using a variety of software programs to extract and clean the raw text and videos.

If the dataset is a sample from a larger set, what was the sampling strategy?	N.A.
Over what timeframe was the data collected?	The dataset was collected over a period of several months in 2021. We do not filter the sources based on creation date.
Were any ethical review processes conducted?	No.
Preprocessing/cleaning/labeling
Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?	Some automatic text formatting was applied to remove from the captions dates and locations that were not relevant to the training objective.
Is the software used to preprocess/clean/label the instances available?	No.
Uses
Has the dataset been used for any tasks already?	Yes, we use the dataset for pre-training multimodal language and vision models.
Is there a repository that links to any or all papers or systems that use the dataset?	No, the dataset has only been used to train the models in this paper.
What (other) tasks could the dataset be used for?	We do not foresee other usages of the dataset at this stage.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?	The dataset is static and thus will become progressively more "stale". For example, it will not reflect new language and norms that evolve over time. However, due to the nature of the dataset it is relatively cheap to collect an up-to-date version.
Are there tasks for which the dataset should not be used?	The dataset described in this paper contains English language text almost exclusively and therefore should not be used for training models intended to have multilingual capabilities.
Distribution
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?	No.

Table 16: VTP Datasheet. We follow the framework as presented by Gebru et al. [30].

G Credit for visual content

Figure 1:
Row 1: All images are provided under license by Unsplash.
Row 2: All images are under the public domain.
Row 3: First two images are provided under license by Unsplash.
Row 5: Available from DALL•E 2 [89].
Row 6: First two are provided under license by Unsplash, the third one is provided by Wikimedia Commons, licensed under CC BY-ND 2.0.
Row 7: The images are provided by Wikimedia Commons, licensed under CC BY-ND 2.0.
Row 8: The images are provided by Wikimedia Commons, licensed under CC BY-ND 2.0.
Row 9: This video is from YFCC100M, licensed under CC BY-ND 2.0.
Dialogue 1: Available from DALL•E 2 [89].
Dialogue 2: The first icon is provided under license by Flaticon, the second image is provided under license by Unsplash, the third one is provided under license by Sketchfab.
Dialogue 3: Available from CLIP [85].
Dialogue 4: Chicago and Tokyo pictures obtained from Unsplash.
Model Figures 3, 7, 9 and 8: All images are provided under license by Unsplash.
Qualitative Figures 10, 11, 12, and 13: All visuals are sourced from various sources including the COCO dataset, Wikimedia Commons, licensed under CC BY-ND 2.0 or available from DALL•E 2 [89].

${ }^{1}$ https://vizwiz.org/ ↩
${ }^{2}$ https://perspectiveapi.com/ ↩

목차