Human vocal signals are essential for information exchange. Recently, research has been conducted on capturing vocal signals not only through microphones but also using radar. While mode decomposition methods are a representative approach for enhancing radar-based vocal signals, their performance is often compromised by a critical dependency on manually-selected parameters. This paper proposes a novel framework, composite mode fitness score-successive variational mode decomposition (CMFS-SVMD), to overcome this limitation. We utilize a 77 GHz frequency-modulated continuous-wave (FMCW) multiple-input multiple-output (MIMO) radar and Curve Length (CL) method for human localization. The core of our work is the CMFS-SVMD which adaptively and automatically selects the optimal balancing parameter for SVMD by minimizing a novel fitness score tailored to vocal signal characteristics. The performance of the proposed algorithm is validated by comparing the extracted fundamental frequency F<sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</sub> against a ground truth derived from a synchronized microphone, using the root mean square error (RMSE) as the primary metric. Experimental results demonstrate that our proposed algorithm accurately tracks the F<sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</sub> of various utterances, including words, sentences, and sustained vowels, proving its robustness and adaptability.