Pi-Chuan ChenTzi-Dar Chiueh2024-10-212024-10-212024-06-30https://scholars.lib.ntu.edu.tw/handle/123456789/722250With the rapid development of AI technology, neural network models have achieved amazinig success in the field of computer vision. In the context of image classification, Vision Transformer (ViT) models, by integrating self-attention and the Transformer architecture, stand out as a representation of stunning progress. Automated search for ViT-based models is imperative in the quest for optimal vision performance. Nevertheless, most existing Neural Architecture Search (NAS) algorithms either lack support for ViTs or demand very long search time. The TPC score and TPC-NAS [1] empower users to quickly, within minutes, explore hundreds of thousands of architectures based on a distinct ViT. In this paper, we enhance TPC-NAS and apply it to the ViT. First, the strong rank correlation (0.97) between the TPC score and accuracy confirms the effectiveness of the TPC score on ViT. Following that, TPC-NAS is customized to align with the characteristics of the ViT, introducing enhancements in the search space and methodology to facilitate finding optimal architectures. In the ImageNet image classification task, the TPC-NAS-discovered ViT model attains a Top-1 accuracy of 82.8% at 4G FLOPs, outperforming all other NAS methods. Finally, for several well-known ViT families, the ViT models discovered by TPC-NAS possess better performance than the corresponding baselines.[SDGs]SDG7TPC-NAS for ViTs: A Systematic Approach to Improve Vision Transformer Performance with Total Path Countconference paper10.1109/ijcnn60899.2024.10651183