A Virtual Try-On Method based on Enhanced Feature Representation and Global Attention

Yuanyuan Li

doi:10.6919/ICJE.202603_12(3).0044

Authors

Yuanyuan Li

DOI:

https://doi.org/10.6919/ICJE.202603_12(3).0044

Keywords:

Virtual Try-On; Garment Warping; Global Attention; Image Synthesis.

Abstract

Virtual try-on (VTON) synthesizes realistic images by mapping a target garment onto a person while preserving structural alignment and texture fidelity. Existing methods often struggle with complex garment deformations and fine-grained details, causing artifacts such as distortions and texture loss. To address these challenges, we propose EAG-VTON, a framework combining enhanced feature representation and global attention. Specifically, we introduce an Enhanced Appearance Flow Warping Module (EAFWM) that integrates pre-activation residual blocks and an enhanced semantic-adaptive normalization (E-SPADE) to improve garment deformation accuracy. For image synthesis, a Residual Generator with Global Attention (RGC) combines ResNetV2 blocks with a Global Grouped Coordinate Attention (GGCA) module to capture long-range dependencies and preserve structural consistency. Experiments on the VITON-HD dataset show that EAG-VTON outperforms state-of-the-art baselines in SSIM, LPIPS, and FID, demonstrating superior structural fidelity and realistic texture reconstruction.

Downloads

Download data is not yet available.

References

[1] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[J]. Advances in neural information processing systems, 2014, 27.

[2] L. Tang and Y. Sun, "Overview of 3D Human Modeling Methods in Digital Garment Engineering," Journal of International Textile, vol. 45, no. 7, pp. 62–65, 2017.

[3] Zhao F, Xie Z, Kampffmeyer M, et al. M3d-vton: A monocular-to-3d virtual try-on network[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 13239-13249.

[4] Hu X, Zheng C, Huang J, et al. Cloth texture preserving image-based 3D virtual try-on[J]. The Visual Computer, 2023, 39(8): 3347-3357.

[5] Han X, Wu Z, Wu Z, et al. Viton: An image-based virtual try-on network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7543-7552.

[6] Zhu H, Cao Y, Jin H, et al. Deep fashion3d: A dataset and benchmark for 3d garment reconstruction from single images[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing, 2020: 512-530.

[7] Saito S, Simon T, Saragih J, et al. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 84-93.

[8] Pons-Moll G, Pujades S, Hu S, et al. ClothCap: Seamless 4D clothing capture and retargeting[J]. ACM Transactions on Graphics (ToG), 2017, 36(4): 1-15.

[9] Mir A, Alldieck T, Pons-Moll G. Learning to transfer texture from clothing images to 3d humans[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 7023-7034.

[10] Minar M R, Tuan T T, Ahn H, et al. 3D reconstruction of clothes using a human body model and its application to image-based virtual try-on[C]//Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.(CVPR) Workshops. 2020.

[11] Lewis K M, Varadharajan S, Kemelmacher-Shlizerman I. Tryongan: Body-aware try-on via layered interpolation[J]. ACM Transactions on Graphics (TOG), 2021, 40(4): 1-10.

[12] Wang J, Sha T, Zhang W, et al. Down to the last detail: Virtual try-on with fine-grained details[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 466-474.

[13] Yang H, Zhang R, Guo X, et al. Towards photo-realistic virtual try-on by adaptively generating-preserving image content[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 7850-7859.

[14] Jetchev N, Bergmann U. The conditional analogy gan: Swapping fashion articles on people images[C]//Proceedings of the IEEE international conference on computer vision workshops. 2017: 2287-2292.

[15] He S, Song Y Z, Xiang T. Style-based global appearance flow for virtual try-on[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 3470-3479.

[16] Wang B, Zheng H, Liang X, et al. Toward characteristic-preserving image-based virtual try-on network[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 589-604.

[17] Yu R, Wang X, Xie X. Vtnfp: An image-based virtual try-on network with body and clothing feature preservation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 10511-10520.

[18] Yang H, Zhang R, Guo X, et al. Towards photo-realistic virtual try-on by adaptively generating-preserving image content[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 7850-7859.

[19] Jandial S, Chopra A, Ayush K, et al. Sievenet: A unified framework for robust image-based virtual try-on[C]//Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2020: 2182-2190.

[20] Choi S, Park S, Lee M, et al. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 14131-14140.

[21] Han X, Hu X, Huang W, et al. Clothflow: A flow-based model for clothed person generation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 10471-10480.

[22] Issenhuth T, Mary J, Calauzenes C. Do not mask what you do not need to mask: a parser-free virtual try-on[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer International Publishing, 2020: 619-635.

[23] Ge Y, Song Y, Zhang R, et al. Parser-free virtual try-on via distilling appearance flows[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 8485-8493.

[24] Lin C, Li Z, Zhou S, et al. Rmgn: A regional mask guided network for parser-free virtual try-on[J]. arXiv preprint arXiv:2204.11258, 2022.

[25] Bai S, Zhou H, Li Z, et al. Single stage virtual try-on via deformable attention flows[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 409-425.

[26] Bookstein F L. Principal warps: Thin-plate splines and the decomposition of deformations[J]. IEEE Transactions on pattern analysis and machine intelligence, 1989, 11(6): 567-585.

[27] Xie Z, Huang Z, Dong X, et al. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 23550-23559.

[28] Park T, Liu M Y, Wang T C, et al. Semantic image synthesis with spatially-adaptive normalization[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 2337-2346.

[29] He K, Zhang X, Ren S, et al. Identity mappings in deep residual networks[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer International Publishing, 2016: 630-645.

[30] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer international publishing, 2015: 234-241.

[31] Shim S H, Chung J, Heo J P. Towards squeezing-averse virtual try-on via sequential deformation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(5): 4856-4863.

[32] Lee S, Gu G, Park S, et al. High-resolution virtual try-on with misalignment and occlusion-handled conditions[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 204-219.

[33] Minar M R, Tuan T T, Ahn H, et al. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on[C]//CVPR workshops. 2020, 3: 10-14.

[34] Isola P, Zhu J Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1125-1134.

[35] Wang Z, Bovik A C, Sheikh H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE transactions on image processing, 2004, 13(4): 600-612.

[36] Zhang R, Isola P, Efros A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 586-595.

[37] Heusel M, Ramsauer H, Unterthiner T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[J]. Advances in neural information processing systems, 2017, 30.

[38] Samy T M, Asham B I, Slim S O, et al. Revolutionizing online shopping with FITMI: A realistic virtual try-on solution[J]. Neural Computing and Applications, 2025, 37(8): 6125-6144.

[39] Han X, Zheng S, Li Z, et al. Shape-Guided Clothing Warping for Virtual Try-On[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 2593-2602.

[40] Du C, Wang J, Yu F, et al. Latent Diffusion-Enhanced Virtual Try-On via Optimized Pseudo-Label Generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(3): 2780-2788.