LG – 机器学习 CV – 计算机视觉 CL – 计算与语言 AS – 音频与语音

1,328次阅读

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言 AS – 音频与语音

1、[LG] The Edge of Orthogonality: A Simple View of What Makes BYOL Tick
2、[CV] Bridging the Sim2Real gap with CARE: Supervised Detection Adaptation with Conditional Alignment and Reweighting
3、[CL] Toolformer: Language Models Can Teach Themselves to Use Tools
4、[CV] Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning
5、[LG] Riemannian Flow Matching on General Geometries
[LG] Spatial Functa: Scaling Functa to ImageNet Classification and Generation
[CV] RelightableHands: Efficient Neural Relighting of Articulated Hand Models
[CV] Q-Diffusion: Quantizing Diffusion Models
[AS] ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models

摘要：BYOL机制探索、基于条件对齐和重加权的监督检测自适应、可以自学使用API工具的语言模型、面向零样本和少样本图像描述的检索增强视觉语言模型、一般几何的黎曼流匹配、用于ImageNet分类和生成的Functa函数、关节手部模型的有效神经重照明、扩散模型量化、基于扩散模型的文本到波形音乐生成

1、[LG] The Edge of Orthogonality: A Simple View of What Makes BYOL Tick

P H. Richemond, A Tam, Y Tang, F Strub, B Piot, F Hill
[DeepMind]

正交的边缘：BYOL机制探索

要点:

调研线性预测器在 BYOL 中的作用，并证明其作为近似的正交投影和潜变量协方差扩展算子的作用；
将 BYOL 的更新解释为黎曼梯度下降步骤，其中昂贵的回缩步骤通过预测器和指数移动平均(EMA)变得微不足道；
提出具有四种闭式预测器的新自监督学习方法，其性能优于标准线性可训练预测器 BYOL，同时只用矩阵乘法。

一句话总结:
为自预测无监督学习方法，特别是BYOL背后的基本机制提供一个简单的解释，通过展示预测器网络和指数移动平均线及停止梯度算子在BYOL中的关键作用来做到这一点。

摘要：
自预测的无监督学习方法，如 BYOL 或 SimSiam，已经显示出令人印象深刻的结果，而且与直觉相反的是，并没有坍缩到平凡的表征。本文旨在探索最简单的数学论证，以解释自预测无监督学习背后的基本机制。本文首先观察到，这些方法主要依赖于预测器网络(和停止梯度)的存在。通过简单的线性代数，本文表明，当使用线性预测器时，最佳预测器接近于正交投影，并提出了一种基于正交化的一般框架，能解释和给出 BYOL 工作的直观原因。此外，该框架还证明了指数移动平均线和停止梯度算子在 BYOL 中作为一种有效的正交机制的关键作用。本文利用这些见解，提出了 BYOL 的四种新的 {闭式预测器} 变体，以支持分析。所提出的闭式预测器在100和300个轮次中超过了标准的线性可训练预测器 BYOL(在ImageNet上的线性准确度排名第一)。

Self-predictive unsupervised learning methods such as BYOL or SimSiam have shown impressive results, and counter-intuitively, do not collapse to trivial representations. In this work, we aim at exploring the simplest possible mathematical arguments towards explaining the underlying mechanisms behind self-predictive unsupervised learning. We start with the observation that those methods crucially rely on the presence of a predictor network (and stop-gradient). With simple linear algebra, we show that when using a linear predictor, the optimal predictor is close to an orthogonal projection, and propose a general framework based on orthonormalization that enables to interpret and give intuition on why BYOL works. In addition, this framework demonstrates the crucial role of the exponential moving average and stop-gradient operator in BYOL as an efficient orthonormalization mechanism. We use these insights to propose four new emph{closed-form predictor} variants of BYOL to support our analysis. Our closed-form predictors outperform standard linear trainable predictor BYOL at 100 and 300 epochs (top-1 linear accuracy on ImageNet).

https://arxiv.org/abs/2302.04817
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音

2、[CV] Bridging the Sim2Real gap with CARE: Supervised Detection Adaptation with Conditional Alignment and Reweighting

V Prabhu, D Acuna, A Liao, R Mahmood, M T. Law, J Hoffman, S Fidler, J Lucas
[Georgia Tech & NVIDIA]

用CARE弥合Sim2Real差距：基于条件对齐和重加权的监督检测自适应

要点:

对有监督的 Sim2Real 目标检测自适应性的详细研究表明，现有的方法由于没有利用目标标签而产生次优的性能；
提出了通过条件对齐和重加权的域翻译(CARE)算法，在标准 Sim2Real 基准上通过明确地弥补 sim2real 外观和内容的差距，在检测适应方面优于其他竞争方法；
用联合风险最小化框架对该设置进行了形式化，并对设计选择提供了理论上的见解。

一句话总结:
研究将 2D 目标检测模型从标记的合成源域自适应到真实的目标域的问题，这是无人驾驶等高风险应用中的一个常见问题。

摘要：
Sim2Real 域自适应(DA)的研究，集中在从一个有标签的合成源域自适应到一个无标签或少标签的真实目标域的限制性设置上。然而，对于高风险的应用(如无人驾驶)，除了大量的自动标记的源数据(如来自驾驶模拟器的数据)之外，通常还有适量的人工标记的真实数据。本文研究了这种应用于 2D 目标检测的有监督的sim2real DA的设置。本文提出了通过条件对齐和重加权的域翻译(CARE)这一新算法，系统地利用目标标签来明确地弥补 sim2real 的外观和内容的差距。对所提出算法进行了分析论证，并展示了在标准基准上比竞争方法更强的优势。

Sim2Real domain adaptation (DA) research focuses on the constrained setting of adapting from a labeled synthetic source domain to an unlabeled or sparsely labeled real target domain. However, for high-stakes applications (e.g. autonomous driving), it is common to have a modest amount of human-labeled real data in addition to plentiful auto-labeled source data (e.g. from a driving simulator). We study this setting of supervised sim2real DA applied to 2D object detection. We propose Domain Translation via Conditional Alignment and Reweighting (CARE) a novel algorithm that systematically exploits target labels to explicitly close the sim2real appearance and content gaps. We present an analytical justification of our algorithm and demonstrate strong gains over competing methods on standard benchmarks.

https://arxiv.org/abs/2302.04832
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音

3、[CL] Toolformer: Language Models Can Teach Themselves to Use Tools

T Schick, J Dwivedi-Yu, R Dessì, R Raileanu, M Lomeli, L Zettlemoyer, N Cancedda, T Scialom
[Meta AI Research]

Toolformer: 可以自学使用API工具的语言模型

要点:

提出一种叫做 Toolformer 的新语言模型，可以通过简单的 API 以自监督方式学习使用外部工具；
Toolformer 在各种下游任务中提高了零样本性能，可以与较大的模型竞争，而不牺牲其核心语言建模能力；
展示了 Toolformer 可以学习使用一系列的工具，包括一个计算器、一个问答系统、两个不同的搜索引擎、一个翻译系统和一个日历。

一句话总结:
Toolformer是一个自监督的语言模型，可以通过 API 调用学习使用不同的工具，在不牺牲其语言建模能力的情况下提高其在下游任务中的表现。

摘要：
语言模型(LM)在解决新任务方面表现出非凡的能力，特别是在规模以上。矛盾的是，它们在基本功能方面很吃力，比如算术或事实查询，而在这些方面，更简单、更小的模型却很出色。本文展示了 LM 可以通过简单的 API 教自己使用外部工具，并实现两个世界的最佳效果。本文提出 Toolformer，一种经过训练的模型，可以决定调用哪些 API，何时调用，传递哪些参数，以及如何将结果最好地纳入未来的标记预测中，以一种自监督方式完成，只需要对每个 API 进行少量的演示即可。本文结合了一系列的工具，包括一个计算器、一个Q/A系统、两个不同的搜索引擎、一个翻译系统和一个日历。Toolformer在各种下游任务中实现了大幅提高的零样本性能，通常与更大的模型竞争，而不牺牲其核心语言建模能力。

Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

https://arxiv.org/abs/2302.04761
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音

4、[CV] Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Z Yang, W Ping, Z Liu, V Korthikanti, W Nie, D Huang, L Fan, Z Yu…
[NVIDIA & UIUC]

Re-ViLM: 面向零样本和少样本图像描述的检索增强视觉语言模型

要点:

Re-ViLM是一个用于图像到文本生成的检索增强型视觉语言模型，建立在 Flamingo 架构基础上；
Re-ViLM通过从外部数据库检索相关知识，减少了模型参数的数量，并能有效地纳入新数据；
Re-ViLM用RETRO初始化，RETRO是一个预训练检索增强语言模型，在多模态预训练的开始就整合了检索能力。
Re-ViLM在检索时采用了简单的过滤策略，以避免”复制粘贴”行为，并采用了新的交错图像-文本数据集进行预训练，有利于在上下文中进行少样本学习。

一句话总结:
提出一种检索增强视觉语言模型(Re-ViLM)，通过整合多模态检索器和检索增强语言模型层，从外部数据库中检索相关知识，增强了最先进的视觉语言模型 Flamingo，用于零样本和少样本的图像描述。

摘要：
用视觉编码器(如 Flamingo)增强预训练语言模型(LM)，在图像到文本生成中获得了最先进的结果。然而，这些模型将所有的知识存储在参数中，往往需要巨大的模型参数来为丰富的视觉概念和非常丰富的文本描述建模。此外，它们在纳入新数据方面效率低下，需要一个计算成本高昂的微调过程。本文提出一个检索增强的视觉语言模型，Re-ViLM，建立在 Flamingo 的基础上，支持从外部数据库中检索相关知识，以实现零样本和上下文少样本图像到文本的生成。通过在外部数据库中显式存储某些知识，该方法减少了模型参数的数量，并且在评估过程中只需更新数据库就能轻松容纳新数据。本文还构建了一个交错的图像和文本数据，以促进内涵式的少样本学习能力。本文证明，Re-ViLM 极大地提高了图像到文本生成任务的性能，特别是在域外设置中的零样本和几样本生成，与基线方法相比，参数少了4倍。

Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computational-expensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the relevant knowledge from the external database for zero and in-context few-shot image-to-text generations. By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters and can easily accommodate new data during evaluation by simply updating the database. We also construct an interleaved image and text data that facilitates in-context few-shot learning capabilities. We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks, especially for zero-shot and few-shot generation in out-of-domain settings with 4 times less parameters compared with baseline methods.

https://arxiv.org/abs/2302.04858
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音

5、[LG] Riemannian Flow Matching on General Geometries

R T. Q. Chen, Y Lipman
[Meta AI]

一般几何的黎曼流匹配

要点:

介绍黎曼尼流匹配(RFM)作为流形上连续规范化流的训练框架；
与流形上生成式建模的现有方法进行比较，强调了 RFM 的优势；
解释 RFM 背后的关键要素，即构建一个简单的核函数来定义每个样本的向量场；
介绍现实世界中非欧几里得数据集的最先进性能，展示在一般几何形状(包括封闭流形和有界流形)上的可操作训练。

一句话总结:
提出黎曼尼流匹配(Riemannian Flow Matching，RFM)，一种可扩展的、高效的方法，用于训练具有一般几何形状的流形上的连续归一化流。

摘要：
提出黎曼流匹配(RFM)，一个简单而强大的框架，用于训练流形上的连续归一化流。现有流形上的生成建模方法，要么需要昂贵的模拟，本质上不能扩展到高维，要么使用有限量的近似值，导致目标有偏差。黎曼流匹配法绕过了这些不便之处，与之前的方法相比表现出多种优势，在简单的几何结构上完全不需要模拟，不需要散度计算，即使在一般的几何结构上，其目标矢量场也是以闭式进行计算的。RFM 背后的关键思想是构建一个简单的核函数来定义每个样本的向量场，涵盖了现有的欧几里得情况。扩展到一般的几何形状，依靠用频谱分解来有效计算核函数。该方法在现实世界的非欧几里得数据集上实现了最先进的性能，本文首次展示了在一般几何学上的可行训练，包括在三角形网格和有边界的迷宫状流形上。

We propose Riemannian Flow Matching (RFM), a simple yet powerful framework for training continuous normalizing flows on manifolds. Existing methods for generative modeling on manifolds either require expensive simulation, inherently cannot scale to high dimensions, or use approximations to limiting quantities that result in biased objectives. Riemannian Flow Matching bypasses these inconveniences and exhibits multiple benefits over prior approaches: It is completely simulation-free on simple geometries, it does not require divergence computation, and its target vector field is computed in closed form even on general geometries. The key ingredient behind RFM is the construction of a simple kernel function for defining per-sample vector fields, which subsumes existing Euclidean cases. Extending to general geometries, we rely on the use of spectral decompositions to efficiently compute kernel functions. Our method achieves state-of-the-art performance on real-world non-Euclidean datasets, and we showcase, for the first time, tractable training on general geometries, including on triangular meshes and maze-like manifolds with boundaries.

https://arxiv.org/abs/2302.03660
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音

另外几篇值得关注的论文：

[LG] Spatial Functa: Scaling Functa to ImageNet Classification and Generation

M Bauer, E Dupont, A Brock, D Rosenbaum, J R Schwarz, H Kim
[DeepMind & University of Haifa]

Spatial Functa: 用于 ImageNet 分类和生成的 Functa 函数

要点:

提出 Spatial Functa 作为将 Functa 扩展到更大和更复杂数据集的方法；
在小型数据集(如CIFAR-10)上的分类和生成任务中观察到functa的局限性；
建议用空间排列的 functa 表示法(Spatial Functa)来克服局限性，并在256×256分辨率的ImageNet-1k上取得有竞争力的性能。

一句话总结:
提出了一种新方法 Spatial Functa，用于扩展现有的 Functa 框架，以处理更大和更复杂的数据集，特别是 256×256 分辨率的 ImageNet-1k，用于分类和生成任务。

Neural fields, also known as implicit neural representations, have emerged as a powerful means to represent complex signals of various modalities. Based on this Dupont et al. (2022) introduce a framework that views neural fields as data, termed functa, and proposes to do deep learning directly on this dataset of neural fields. In this work, we show that the proposed framework faces limitations when scaling up to even moderately complex datasets such as CIFAR-10. We then propose spatial functa, which overcome these limitations by using spatially arranged latent representations of neural fields, thereby allowing us to scale up the approach to ImageNet-1k at 256×256 resolution. We demonstrate competitive performance to Vision Transformers (Steiner et al., 2022) on classification and Latent Diffusion (Rombach et al., 2022) on image generation respectively.

https://arxiv.org/abs/2302.03130
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音

[CV] RelightableHands: Efficient Neural Relighting of Articulated Hand Models

S Iwase, S Saito, T Simon, S Lombardi, T Bagautdinov, R Joshi, F Prada, T Shiratori, Y Sheikh, J Saragih
[Meta AI & CMU]

RelightableHands: 关节手部模型的有效神经重照明

要点:

提出第一个用于个性化手部模型的神经重照明方法，可以在新的照明条件下进行实时动画；
采用师生框架，在任意光照下合成高保真的手，进行大量计算，并对自然光照下的外观进行有效的实时预测；
使用物理学启发的照明特征，如可见度和漫反射阴影，作为神经再照明网络的调节数据，与光传输效应密切相关。

一句话总结:
提出第一个用于高保真个性化手部模型的神经重照明方法，可以在新的照明条件下进行实时动画。

We present the first neural relighting approach for rendering high-fidelity personalized hands that can be animated in real-time under novel illumination. Our approach adopts a teacher-student framework, where the teacher learns appearance under a single point light from images captured in a light-stage, allowing us to synthesize hands in arbitrary illuminations but with heavy compute. Using images rendered by the teacher model as training data, an efficient student model directly predicts appearance under natural illuminations in real-time. To achieve generalization, we condition the student model with physics-inspired illumination features such as visibility, diffuse shading, and specular reflections computed on a coarse proxy geometry, maintaining a small computational overhead. Our key insight is that these features have strong correlation with subsequent global light transport effects, which proves sufficient as conditioning data for the neural relighting network. Moreover, in contrast to bottleneck illumination conditioning, these features are spatially aligned based on underlying geometry, leading to better generalization to unseen illuminations and poses. In our experiments, we demonstrate the efficacy of our illumination feature representations, outperforming baseline approaches. We also show that our approach can photorealistically relight two interacting hands at real-time speeds.

https://arxiv.org/abs/2302.04866
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音

[CV] Q-Diffusion: Quantizing Diffusion Models

X Li, L Lian, Y Liu, H Yang, Z Dong, D Kang, S Zhang, K Keutzer
[UC Berkeley & Nanjing University & Peking University]

Q-Diffusion: 扩散模型量化

要点:

提出一种通过用训练后量化(PTQ)压缩噪声估计网络来加速扩散模型的新方法；
提出的 Q-Diffusion 方法在扩散模型去噪计算中用多个时间步进行校准，从而提高了量化模型的性能；
该方法使扩散模型的权重量化为4位或8位，实现了最多1.88的FID变化，同时保持了相当的性能。

一句话总结:
提出一种新的扩散模型PTQ方法，平衡了校准质量和数据集的大小，在像素空间和潜空间扩散模型产生了定性可比的结果，且FID变化较小。

Diffusion models have recently achieved great success in synthesizing diverse and high-fidelity images. However, sampling speed and memory constraints remain a major barrier to the practical adoption of diffusion models as the generation process for these models can be slow due to the need for iterative noise estimation using complex neural networks. We propose a solution to this problem by compressing the noise estimation network to accelerate the generation process using post-training quantization (PTQ). While existing PTQ approaches have not been able to effectively deal with the changing output distributions of noise estimation networks in diffusion models over multiple time steps, we are able to formulate a PTQ method that is specifically designed to handle the unique multi-timestep structure of diffusion models with a data calibration scheme using data sampled from different time steps. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a FID change of at most 1.88. Our approach can also be applied to text-guided image generation, and for the first time we can run stable diffusion in 4-bit weights without losing much perceptual quality, as shown in Figure 5 and Figure 9.

https://arxiv.org/abs/2302.04304
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音

[AS] ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models

P Zhu, C Pang, S Wang, Y Chai, Y Sun, H Tian, H Wu
[Baidu Inc]

ERNIE-Music: 基于扩散模型的文本到波形音乐生成

要点:

提出第一个文本到波形音乐生成模型，该模型接收自由格式文本作为条件，并基于扩散模型生成音乐；
通过从互联网上收集弱监督数据，解决了缺乏自由格式文本-音乐平行数据的问题；
研究和比较了两种形式的文本对生成模型的影响，发现自由形式的文本提供了更好的文本-音乐相关性。

一句话总结:
提出 ERNIE-Music，第一个以自由格式文本为条件生成波形音乐的音乐生成模型，通过收集互联网上的数据及其配对的评论文本，该模型在多样性、质量和文本-音乐相关性方面优于相关工作。

In recent years, there has been an increased popularity in image and speech generation using diffusion models. However, directly generating music waveforms from free-form text prompts is still under-explored. In this paper, we propose the first text-to-waveform music generation model that can receive arbitrary texts using diffusion models. We incorporate the free-form textual prompt as the condition to guide the waveform generation process of diffusion models. To solve the problem of lacking such text-music parallel data, we collect a dataset of text-music pairs from the Internet with weak supervision. Besides, we compare the effect of two prompt formats of conditioning texts (music tags and free-form texts) and prove the superior performance of our method in terms of text-music relevance. We further demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.

https://arxiv.org/abs/2302.04456
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音