기본 정보
연구 분야
프로젝트
논문
구성원
article|
인용수 0
·2024
Channel and Spatial Enhancement Network for human parsing
Kunliang Liu, Rize Jin, Yuelong Li, Jianming Wang, Wonjun Hwang
IF 4.2Image and Vision Computing
초록

The dominant backbones of neural networks for scene parsing consist of multiple stages, where feature maps in different stages often contain varying levels of spatial and semantic information. High-level features convey more semantics and fewer spatial details, while low-level features possess fewer semantics and more spatial details. Consequently, there are semantic-spatial gaps among features at different levels, particularly in human parsing tasks. Many existing approaches directly upsample multi-stage features and aggregate them through addition or concatenation, without addressing the semantic-spatial gaps present among these features. This inevitably leads to spatial misalignment, semantic mismatch, and ultimately misclassification in parsing, especially for human parsing that demands more semantic information and more fine details of feature maps for the reason of intricate textures, diverse clothing styles, and heavy scale variability across different human parts. In this paper, we effectively alleviate the long-standing challenge of addressing semantic-spatial gaps between features from different stages by innovatively utilizing the subtraction and addition operations to recognize the semantic and spatial differences and compensate for them. Based on these principles, we propose the Channel and Spatial Enhancement Network (CSENet) for parsing, offering a straightforward and intuitive solution for addressing semantic-spatial gaps via injecting high-semantic information to lower-stage features and vice versa, introducing fine details to higher-stage features. Extensive experiments on three dense prediction tasks have demonstrated the efficacy of our method. Specifically, our method achieves the best performance on the LIP and CIHP datasets and we also verify the generality of our method on the ADE20K dataset. • We propose the CSENet which effectively addresses the challenge of semantic and spatial gaps between feature maps from different stages in human parsing. By utilizing the operations of subtraction and addition to calculate and compensate the feature differences, CSENet reduces the semantic gaps and successfully introduce high-semantic information to low-level feature and fine details to high-level feature to benefit recognizing large objects and inconspicuous parts, especially in the context of human parsing. • We introduce CEM and SEM as the main components of CSENet. CEM employs average pooling, subtraction and addition to calculate and compensate semantic differences, while SEM utilizes similar operations to compute and compensate the spatial differences. These modules enhance the discriminative ability of feature representations, improving the recognition of fine details, inner patterns, and accurate spatial locations of human parts. • Our CSENet is shown to be effective and efficient in improving the performance of existing backbones. Our modules are general and can be easily integrated into existing architectures, enabling the effective assembly of feature maps from deep to shallow layers. Experimental results demonstrate the efficacy of our CSENet. Our method achieves SOTA performances on LIP and CIHP datasets without using pose information or the hierarchy structure of the class in the scene. We also validate the generality of our method via using the transformer as the backbone on the scene parsing dataset ADE20K.

키워드
ParsingChannel (broadcasting)Computer scienceArtificial intelligenceNatural language processingComputer network
타입
article
IF / 인용수
4.2 / 0
게재 연도
2024

주식회사 디써클

대표 장재우,이윤구서울특별시 강남구 역삼로 169, 명우빌딩 2층 (TIPS타운 S2)대표 전화 0507-1312-6417이메일 info@rndcircle.io사업자등록번호 458-87-03380호스팅제공자 구글 클라우드 플랫폼(GCP)

© 2026 RnDcircle. All Rights Reserved.