Accurate trajectory estimation using low-cost inertial measurement unit (IMU) data remains a challenging task due to sensor noise, bias, and drift. In this paper, we propose the Convolutional IMU Transformer (CoIT), a deep neural network that combines Convolutional Token Embedding with a hierarchical Transformer architecture to effectively capture both local motion features and global temporal dependencies from raw IMU measurements. The proposed network is designed to overcome the limitations of traditional dead reckoning methods and prior learning-based approaches that rely heavily on inaccurate orientation estimates. CoIT further incorporates a multi-stage Convolutional Projection and downsampling strategy to improve computational efficiency while preserving representational capacity. We evaluate CoIT on three large-scale public datasets (RIDI, RoNIN, and NeurIT) and demonstrate consistent performance gains over existing state-of-the-art models. On the RoNIN dataset, CoIT achieves an absolute trajectory error (ATE) of 3.39 ± 0.59 m, a time-normalized relative trajectory error (T-RTE) of 2.63 ± 0.57 m, and a distance-normalized relative trajectory error (D-RTE) of 0.17 ± 0.02 m, while also demonstrating robust performance on unseen subjects and competitive cross-domain transfer from pedestrian motion (RoNIN) to indoor robot motion (NeurIT) without fine-tuning. Furthermore, an ablation study on model complexity reveals a clear trade-off between accuracy and latency, demonstrating the architecture’s scalability for practical deployment. These results highlight the robustness and statistical reliability of the proposed model within the evaluated public benchmarks in GPS-denied settings.