This research advances the field of human activity recognition (HAR) by developing a robust and interpretable deep learning model using wearable sensor data. We address seven discrete activities through a multimodal fusion architecture that synergistically combines temporal convolutional networks (TCNs), convolutional neural networks (CNNs), and long short-term memory (LSTM). Each network type caters to its strength: TCNs for temporal dependencies, CNNs for local features, and LSTMs for sequential information. A dedicated fusion layer seamlessly integrates these features, achieving a remarkable mean accuracy of 98.7% on challenging data. Finally, fivefold cross-validation is done to validate our results. We find a mean accuracy of 98.7% and a standard deviation of 0.003. In addition, we use local interpretable model-agnostic explanations (LIMs) and Shapley additive explanations (SHAP) to offer insights into the model’s decision-making process, thereby improving its transparency and fostering confidence. This study contributes by providing robust and interpretable deep learning models that can be used in various applications.