Automatic Clustering and Division of Chinese Dialects and Related Computational Methods
JIANG Di
Author information+
{{custom_zuoZheDiZhi}}
{{custom_authorNodes}}
{{custom_bio.content}}
{{custom_bio.content}}
{{custom_authorNodes}}
Collapse
History+
Published
2022-03-25
Issue Date
2022-05-12
Abstract
This paper reviews three measuring methods of the relationships between Chinese dialects: feature statistics, etymological statistics and lexical similarity measures, pointing out that these three measures employ a non-holistic, phonetically and lexically constrained methods of examination. This paper expounds a more applicable calculation model, the Levenshtein Distance algorithm (or Edit Distance), which has an integrated and coordinated function for phonological similarity and lexical correspondence of linear strings between languages or dialects, and implies feature comparison and etymological probability utilities. The automatic dialect classifying experiments in this paper collect 78 dialects from eight districts of Wu, Min, Yue, Xiang, Ke, Gan, Hui and Huai in the South China, and 108 dialects from eight divisions of Mandarin, namely Dialects of Dongbei, Beijing, Ji-lu, Jiao-Liao, Zhongyuan, Lan-Yin, Xinan and Jin Dialect, for a total of 186 Chinese dialects. Swadesh's 100 basic words were collected for each dialect and similarity calculations were carried out between the dialects. The calculation results are basically consistent with the traditional partitioning, but more precise.