Scene representation in autonomous driving relies heavily on extracting meaningful features from images and accurately mapping them to 3D world coordinates. Traditional methods, such as ResNet-based backbones pretrained on ImageNet, provide a robust foundation for feature extraction but are increasingly viewed as limited when it comes to aligning features with the 3D world. This paper explores the integration of advanced segmentation models as backbones, focusing on how feature quality at the extraction stage directly impacts downstream scene representation tasks. Preliminary experiments demonstrate the potential for improved feature alignment and semantic consistency, highlighting the importance of robust backbone design in modern 3D perception pipelines.