输入“/”快速插入内容

LoRA

2023年11月17日修改

论文

•
LoRA：Low-Rank Adaptation of Large Language Models​

•

https://arxiv.org/abs/2106.09685

摘要

•
在Transformer每个层之间插入rank decomposition matrices 秩分解矩阵​

[SVD和低秩矩阵近似（Low-rank Matrix Approximation）的数据压缩 - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/447385674)

•
有更高的训练吞吐，没有额外的推理延迟​

结论

•
没有引入推理延迟，也没有减少输入序列长度，允许快速的任务迁移 quick task-switching​

•
未来工作​
◦
和其他adaptation结合，尤其是正交的提升​
◦
微调和LoRA背后的原理还不清楚，预训练学习到的特征是怎么在下游任务上有效的？LoRA可能比Full Finetune更tractable地回答了这个问题​
◦
我们主要依赖启发式方法来选择权重矩阵，有没有更有原则的方法来做​
◦
the rank-deficiency of ∆W suggests that W could be rank-deficient as well，这一点很特别：
可能也是低秩的​

引言

•
目前的工作的问题​
◦
引入推理延迟​
◦
降低模型可用sequence length​
◦
不如基线模型​

•
我们认识到 the learned over-parametrized models in fact reside on a low intrinsic dimension​

The statement "the learned over-parametrized models in fact reside on a low intrinsic dimension" means that even though deep neural networks have a large number of parameters, they can still be represented by a much smaller set of underlying features. This is known as the "intrinsic dimension" of the model. In other words, the model can be thought of as having a lower-dimensional structure that captures most of the important information needed for its task. This insight has led to techniques like LoRA, which exploit this low-dimensional structure to reduce the number of trainable parameters and make pre-trained models more efficient.​

•
我们假设模型adaptation时参数的变化也有 low intrinsic rank，所以我们优化 dense层的rank decomposition matrices​

问题定义

 是预训练自回归语言模型。其中，NL2SQL里  
 是自然语言指令，
 是 SQL 命令；总结里面，
 是文章内容，
 是总结。​

在训练时是优化如下目标：

common.docs_name - LarkCCM_Docs_Menu_Image

每次都将

优化到

full fine-tuning 的问题就是对于每个下游任务都要学习一个集合 
 并且这个 
 的维度和 
 相同，导致存储和训练很 challenging，甚至不一定 feasible​

英语短语 if at all，甚至不一定（表怀疑）

LoRA 训练一个更小的参数

，让

•
全量微调：一个参数需要16个字节来存，2个字节（半精度）存权重，2个字节存激活，4个字节存权重复制，两个4字节存adam里面的变量​

•
LoRA：权重和激活还是要存的，然后剩下部分参数存上面这些东西​

实验

LoRA​

LoRA