输入“/”快速插入内容

【HD decoding】MiniCPM-Llama3-V 2.5

2024年8月26日修改
一、Image input for Vit
Since Bert's fire, the transformer architecture has been attracting attention in the image field, so vision transformer (vit) appeared.
The input image size is 224x224, and the image is divided into fixed-size patches. The patch size is 16x16, then each image will generate 224x224/16x16=196 patches, that is, the input sequence length is 196, and the dimension of each patch is 16x16x3=768, linear The dimension of the projection layer is 768xN (N=768), so the dimension of the input after passing through the linear projection layer is still 196x768, that is, there are 196 tokens in total, and the dimension of each token is 768. A special character cls needs to be added here, so the final dimension is 197x768. So far, a visual problem has been transformed into a seq2seq problem through patch embedding.
二、What should I do if the image resolution is inconsistent with pre-training?
The above vit example uses a resolution of 224*224 as input. Once the model is determined, the input resolution is also determined and cannot be changed.
1.
Image scaling
At this time, the input image can be interpolated to scale the image to 224*224 (target value). This is the current common practice in vit. However, there may be some problems after the image is scaled. The ellipse in the left picture may obtain a perfect circle after scaling. , which obviously has different information from the original image.
2.
padding
Friends who have studied convolution know that padding can be used when the image size is not suitable.
Is it possible to pad the image to the vit pre-trained model size (such as 448*448), so that the original aspect ratio of the image is retained? (The picture on the right below is a complete picture)
Disadvantages of padding:
1.
Computational efficiency: The padding part is added artificially, which means that the more of this part, the lower the computing efficiency.
2.
Accuracy: Experiments show that as the padding part increases, the accuracy of the multi-modal model will decrease.
3.
Overlapping segmentation
As shown in the segmentation method above, a rectangular input image can be segmented into two squares, red and black, for input into the pre-training model.
But there are still problems:
If there are 6 circles in the rectangular picture, dividing the image into red and black will result in 4 circles each in the red and black boxes, resulting in a total of 8 circles, which will easily deepen the illusion of the model.
三、HD picture puzzle