Using Multiple Monolingual Models for Efficiently Embedding Korean and English Conversational Sentences | 신유현 교수 연구실 | 고려대학교 언어학과

신유현 교수 연구실

서비스 플랜

연구실 검색

프로젝트 공고

정부 과제 추천

AI 기반 기업 서칭

홈

기본 정보

연구 분야

프로젝트

발행물

구성원

article|

gold

·인용수 3

·2023

Using Multiple Monolingual Models for Efficiently Embedding Korean and English Conversational Sentences

Youngki Park, Youhyun Shin

IF 2.5Applied Sciences

초록

This paper presents a novel approach for finding the most semantically similar conversational sentences in Korean and English. Our method involves training separate embedding models for each language and using a hybrid algorithm that selects the appropriate model based on the language of the query. For the Korean model, we fine-tuned the KLUE-RoBERTa-small model using publicly available semantic textual similarity datasets and used Principal Component Analysis (PCA) to reduce the resulting embedding vectors. We also selected a highly-performing English embedding model from available SBERT models. We compared our approach to existing multilingual models using both human-generated and large language model-generated conversational datasets. Our experimental results demonstrate that our hybrid approach outperforms state-of-the-art multilingual models in terms of accuracy, elapsed time for sentence embedding, and elapsed time for finding the nearest neighbor, regardless of whether a GPU is used. These findings highlight the potential benefits of training separate embedding models for different languages, particularly for tasks involving finding the most semantically similar conversational sentences. We expect that our approach will be used for diverse natural language processing-related fields, including machine learning education.

키워드

Computer scienceEmbeddingNatural language processingArtificial intelligenceSentenceSimilarity (geometry)

타입

article

IF / 인용수

2.5 / 3

원문

https://doi.org/10.3390/app13095771

게재 연도

2023