Volume no :1, Issue no: 1, October (2014)

A NEW METHOD FOR CLASSIFYING CHINESE TEXT BASED ON SEMANTIC TOPICS AND DENSITY PEAKS

Author's: Yewang Chen and Jixiang Du
Pages: [35] - [54]
Received Date: August 11, 2014
Submitted by: Jianqiang Gao.

Abstract

This paper presents a new classification method for Chinese texts based on semantics topics and density peaks. The main motivation of this work is that the most existing text classification methods fail to deal with Chinese Web texts, because of the sparsity and irregularity of these Web texts. The novel method proposed, comes up with an idea of gaining the real semantics of text as features; these semantics should be stable and abstract that express the high hierarchical semantics behind the text, and can be used to differentiate different text categories. Therefore, firstly, BaiduBaike is used to extract the semantic topics as the real semantics from a text. Secondly, a clustering method is applied for finding the density peaks of each text category. Finally, the text is classified by the distances among the text and density peaks. This method deal with Chinese Web short texts well with fewer training data. The conducted experiments have shown that our method is promising, especially in the case of training data is not enough or processing Chinese Web texts.

Keywords

semantic topic, BaiduBaike, density peaks, Chinese phrase.