기본 정보
연구 분야
프로젝트
발행물
구성원
article|
gold
·인용수 3
·2023
Using Multiple Monolingual Models for Efficiently Embedding Korean and English Conversational Sentences
Youngki Park, Youhyun Shin
IF 2.5Applied Sciences
초록

This paper presents a novel approach for finding the most semantically similar conversational sentences in Korean and English. Our method involves training separate embedding models for each language and using a hybrid algorithm that selects the appropriate model based on the language of the query. For the Korean model, we fine-tuned the KLUE-RoBERTa-small model using publicly available semantic textual similarity datasets and used Principal Component Analysis (PCA) to reduce the resulting embedding vectors. We also selected a highly-performing English embedding model from available SBERT models. We compared our approach to existing multilingual models using both human-generated and large language model-generated conversational datasets. Our experimental results demonstrate that our hybrid approach outperforms state-of-the-art multilingual models in terms of accuracy, elapsed time for sentence embedding, and elapsed time for finding the nearest neighbor, regardless of whether a GPU is used. These findings highlight the potential benefits of training separate embedding models for different languages, particularly for tasks involving finding the most semantically similar conversational sentences. We expect that our approach will be used for diverse natural language processing-related fields, including machine learning education.

키워드
Computer scienceEmbeddingNatural language processingArtificial intelligenceSentenceSimilarity (geometry)
타입
article
IF / 인용수
2.5 / 3
게재 연도
2023