Automatic pain assessment involves accurately recognizing and quantifying pain, dependent on the data modality that may originate from various sources such as video and physiological signals. Traditional pain assessment methods rely on subjective self-reporting, which limits their objectivity, consistency, and overall effectiveness in clinical settings. While machine learning offers a promising alternative, many existing approaches rely on a single data modality, which may not adequately capture the multifaceted nature of pain-related responses. In contrast, multimodal approaches can provide a more comprehensive understanding by integrating diverse sources of information. To address this, we propose a dual-stream framework for classifying physiological and behavioral correlates of pain that leverages multimodal data to enhance robustness and adaptability across diverse clinical scenarios. Our framework begins with masked autoencoder pre-training for each modality: facial video and multivariate bio-psychological signals, to compress the raw temporal input into meaningful representations, enhancing their ability to capture complex patterns in high-dimensional data. In the second stage, the complete classifier consists of a dual hybrid positional encoding embedding and cross-attention fusion. The pain assessment evaluations reveal our model's superior performance on the AI4Pain and BioVid datasets for electrode-based and heat-induced settings.