The input image size is 224x224, and the image is divided into fixed-size patches. The patch size is 16x16, then each image will generate 224x224/16x16=196 patches, that is, the input sequence length is 196, and the dimension of each patch is 16x16x3=768, linear The dimension of the projection layer is 768xN (N=768), so the dimension of the input after passing through the linear projection layer is still 196x768, that is, there are 196 tokens in total, and the dimension of each token is 768. A special character cls needs to be added here, so the final dimension is 197x768. So far, a visual problem has been transformed into a seq2seq problem through patch embedding.