西北大学学报(自然科学版)

2025, 01, v.55 106-117

面向非遗美术图像分类的提示学习方法

1.西北大学文化遗产数字化国家地方联合工程研究中心 2.西北大学信息科学与技术学院 3.加州戴维斯大学文理学院 4.北京师范大学教育部虚拟现实应用工程研究中心

基金项目(Foundation): 虚拟现实技术与系统全国重点实验室(北京航空航天大学)开放课题基金(VRLAB2024C02); 文化和旅游部重点实验室项目(1222000812、cr2021K01); 西安市科技计划社会发展科技创新示范项目(2024JH-CXSF-0014); 国家自然科学基金(62271393)

邮箱(Email): liuxinda@nwu.edu.cn;

DOI: 10.16152/j.cnki.xdxbzr.2025-01-009

发布时间： 2025-01-20

出版时间： 2025-01-20

网络发布时间： 2025-01-20

移动端阅读

300	1	56
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

针对中国非物质文化遗产美术作品分类中处理效率低、数据复杂等问题，提出了一种基于预训练视觉语言大模型的上下文提示微调策略，以提升小样本情况下的分类性能并应对当前任务的挑战。该方法通过引入可学习的上下文优化提示(软提示),使模型能够在少量样本条件下快速适应下游分类任务，从而有效缩短训练时间并提升收敛速度。具体而言，利用注意力机制，将由软提示生成的文本特征与预训练视觉语言模型的原始特征相结合，并通过对比损失优化嵌入表示。这一机制减少了不同特征之间的嵌入差异，避免了模型对已知类别的过度拟合，提升了在未见类别上的泛化能力。此外，保留原始特征信息帮助模型避免训练过程中遗忘基础知识，确保即便在小样本条件下，模型仍能保持较高的分类准确率。实验结果表明，所提出方法在非遗美术图像分类任务中的准确率提升了1.79%,泛化识别能力提升了10.4%,同时具备较低的计算成本。

关键词： 非物质文化遗产; 图像分类; 上下文优化; 注意力机制;

Abstract：

To address the issues of prolonged processing time, low efficiency, and high data complexity in the classification of Chinese intangible cultural heritage(ICH) artworks, this paper proposes a context-based text prompt tuning strategy based on a pre-trained vision-language model. This approach introduces trainable context optimization soft prompts, enabling the model to quickly adapt to downstream classification tasks under limited sample conditions, thereby effectively reducing training time and improving convergence speed. Specifically, the proposed method integrates text features generated by the soft prompts with the original features of the pre-trained vision-language model through an attention mechanism, and optimizes the embedded representations via a contrastive loss function. This mechanism significantly reduces the embedding discrepancy between the two types of features, preventing the model from overfitting to visible base categories and enhancing its generalization ability to unseen classes. Moreover, the retention of original features helps mitigate catastrophic forgetting during training, ensuring high classification accuracy even under few-shot conditions. Experimental results demonstrate that the proposed method improves classification accuracy by 1.79%, enhances generalization by 10.4%, and maintains low computational cost.

KeyWords： intangible cultural heritage; image classification; contextual optimization; attention mechanism;

如需获取全文，请访问cnki.net

参考文献

[1] 黄永林.中国非遗传承保护的四重价值[J].人民论坛·学术前沿，2024(1):76-83.HUANG Y L.The protection and inheritance of Chinese intangible cultural heritage:Its quadruple values[J].People’s Forum:Academic Frontier,2024(1):76-83.

[2] 王燕妮.中国民俗类非物质文化遗产分类研究[J].湖北民族学院学报(哲学社会科学版),2017,35(2):115-120.WANG Y N.A study on the classification of chinese folklore intangible cultural heritage [J].Journal of Hubei Minzu University (Philosophy and Social Sciences Edition),2017,35(2):115-120.

[3] 季长清，高志勇，秦静，等.基于卷积神经网络的图像分类算法综述[J].计算机应用，2022,42(4):1044-1049.JI C Q,GAO Z Y,QIN J,et al.Review of image classification algorithms based on convolutional neural network[J].Computer Applications,2022,42(4):1044-1049.

[4] LE CUN Y,BOSER B,DENKER J S,et al.Handwritten digit recognition with a back-propagation network[C]//Proceedings of the 3rd International Conference on Neural Information Processing Systems.ACM,1989:396-404.

[5] ISMAIL FAWAZ H,LUCAS B,FORESTIER G,et al.InceptionTime:Finding AlexNet for time series classification[J].Data Mining and Knowledge Discovery,2020,34(6):1936-1962.

[6] SENGUPTA A,YE Y T,WANG R,et al.Going deeper in spiking neural networks:VGG and residual architectures[J].Frontiers in Neuroscience,2019,13:95.

[7] ZHU Y,NEWSAM S.DenseNet for dense flow[C]//2017 IEEE International Conference on Image Processing (ICIP).September 17-20,2017.Beijing,China.IEEE,2017:790-794.

[8] KOONCE B.Convolutional Neural Networks with Swift for Tensorflow:Image Recognition and Dataset Categorization[M].Berkeley,CA:Apress,2021.

[9] 昝楠楠.基于全局CNN与局部LSTM的国画图像分类算法[J].自动化技术与应用，2024,43(4):115-117.SUI N N.Chinese painting image classification algorithm based on global CNN and local LSTM[J].Automation Technology and Application,2024,43(4):115-117.

[10] 生龙，马建飞，杨瑞欣，等.基于特征交换的CNN图像分类算法研究[J].计算机工程，2020,46(9):268-273.SHENG L,MA J F,YANG R X,et al.Research on CNN image classification algorithm based on feature exchange [J].Computer Engineering,2020,46(9):268-273.

[11] CHEN T,KORNBLITH S,NOROUZI M,et al.A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning.PMLR,2020:1597-1607.

[12] LIU X,ZHU Y,LIU L,et al.Feature-suppressed contrast for self-supervised food Pre-training[C]//Proceedings of the 31st ACM International Conference on Multimedia.2023:4359-4367.

[13] 朱若琳，蓝善祯，朱紫星.视觉-语言多模态预训练模型前沿进展[J].中国传媒大学学报(自然科学版),2023,30(1):66-74.ZHU R L,LAN S Z,ZHU Z X.Asurvey on vision-language multimodality pre-training[J].Journal of Communication University of China (Natural Science Edition),2023,30(1):66-74.

[14] RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International conference on machine learning.PMLR,2021:8748-8763.

[15] LEI Y,LI J,LI Z,et al.Prompt learning in computer vision:A survey[J].Frontiers of Information Technology & Electronic Engineering,2024,25(1):42-63.

[16] ZHOU K Y,YANG J K,LOY C C,et al.Learning to prompt for vision-language models[J].International Journal of Computer Vision,2022,130(9):2337-2348.

[17] ZHOU K Y,YANG J K,LOY C C,et al.Conditional prompt learning for vision-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2022:16816-16825.

[18] YAO H T,ZHANG R,XU C S.Visual-language prompttuning with knowledge-guided context optimization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2023:6757-6767.

[19] LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation[J].Advances in neural information processing systems,2021,34:9694-9705.

[20] GONDAL M W,GAST J,RUIZ I A,et al.Domain aligned CLIP for few-shot classification[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,2024:5709-5718.

[21] LONG S F,ZHAO Z,YUAN J K,et al.Task-oriented multi-modal mutual leaning for vision-language models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2023:21959-21969.

[22] PHAM C,NGUYEN V A,LE T,et al.Frequency attention for knowledge distillation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,2024:2277-2286.

[23] CHORDIA S,PAWAR Y,KULKARNI S,et al.Attention is all you need to tell:Transformer-based image captioning[M]//Advances in Distributed Computing and Machine Learning:Proceedings of ICADCML 2022.Singapore:Springer Nature Singapore,2022:607-617.

[24] LU J C,ZHANG J G,ZHU X T,et al.Softmax-free linear transformers[J].International Journal of Computer Vision,2024,132(8):3355-3374.

[25] ZHANG J Y,HUANG J X,JIN S,et al.Vision-language models for vision tasks:A survey[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2024,46(8):5625-5644.

[26] XIA P P,ZHANG L,LI F Z.Learning similarity with cosine similarity ensemble[J].Information sciences,2015,307:39-52.

[27] SUN G Y,CHENG Y N,ZHANG Z X,et al.Text classification with improved word embedding and adaptive segmentation[J].Expert Systems with Applications,2024,238:121852.

[28] LASTRAS L A.Information theoretic lower bounds on negative log likelihood[EB/OL].2019:1904.06395.https://arxiv.org/abs/1904.06395 vl.

[29] YU M X,WANG J,YOU R,et al.Multiple-local feature and attention fused person re-identification method[J].Intelligent Data Analysis,2024,28(6):1679-1695.

[30] GUO A Y,SHEN K,LIU J J.FE-FAIR:Feature-Enhanced Fused Attention for Image Super-Resolution[J].Electronics,2024,13(6):1075.

[31] NILSBACK M E,ZISSERMAN A.Automated flower classification over a large number of classes[C]//2008 Sixth Indian Conference on Computer Vision,Graphics & Image Processing.Bhubaheswar,India,IEEE,2008:722-729.

[32] ZHAO P S,XIE L X,ZHANG Y,et al.Universal-to-specific framework for complex action recognition[J].IEEE Transactions on Multimedia,2020,23:3441-3453.

[33] PINZóN-ARENAS J O,JIMéNEZ-MORENO R,PAC-HóN-SUESCUN C G.ResSeg:Residual encoder-decoder convolutional neural network for food segmentation[J].International Journal of Electrical & Computer Engineering (IJECE),2020,10(1):1017.

基本信息:

DOI：10.16152/j.cnki.xdxbzr.2025-01-009

中图分类号:J05;TP391.41

引用信息:

[1]张秦瑜,刘鑫达,鲁倬铭,等.面向非遗美术图像分类的提示学习方法[J].西北大学学报(自然科学版),2025,55(01):106-117.DOI:10.16152/j.cnki.xdxbzr.2025-01-009.

基金信息:

虚拟现实技术与系统全国重点实验室(北京航空航天大学)开放课题基金(VRLAB2024C02); 文化和旅游部重点实验室项目(1222000812、cr2021K01); 西安市科技计划社会发展科技创新示范项目(2024JH-CXSF-0014); 国家自然科学基金(62271393)

发布时间：

2025-01-20

出版时间：

2025-01-20

网络发布时间：

2025-01-20

请选择需要下载的pdf数据

西北大学学报(自然科学版)

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文

请选择需要下载的pdf数据

西北大学学报(自然科学版)

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

引用

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈