Speech recognition and time-series data processing are crucial for applications such as human–computer interaction, assistive technologies, and biometric authentication. However, traditional methods often struggle with noisy data, speaker variability, and high-dimensional feature spaces, which limit their accuracy and interpretability. This study proposes a novel framework that integrates Dynamic Time Warping (DTW) with Multidimensional Scaling (MDS) to improve the visualization and analysis of speech time-series data. The framework consists of four stages: data preparation, preprocessing, DTW distance calculation, and two-dimensional (2D) vector space mapping. Lip regions were extracted from video frames and represented using raw grayscale images, lip-shaped approximations, and hybrid features. DTW is applied to measure temporal similarity, followed by MDS to project the data into a lower-dimensional space for clearer feature distribution and more efficient Cluster Validity Index (CVI) computation. The experimental results show that the proposed approach enhances recognition performance. Among the features tested, V_Δgray achieved the highest speaker-dependent recognition rate of 94.96% (±0.0364 standard deviation), whereas V_shape yielded the best speaker-independent recognition rate of 50.91% (±0.2596 standard deviation). Additionally, syllable- and word-level analyses further confirmed the robustness of V_shape. In conclusion, the DTW–MDS framework improves class separability and interpretability and offers a reliable and efficient method for time-series speech analyses. These findings have significant implications for mobile and wearable speech recognition systems.