项目地址

https://github.com/Embedding/Chinese-Word-Vectors

Chinese-Word-Vectors项目提供超过100种中文词向量,其中包括不同的表示方式(稠密SGNS和稀疏PPMI)、不同的上下文特征(词、N元组、字等等)、以及不同的训练语料。获取预训练词向量非常方便,下载后即可用于下游任务。


参考文献

如果使用了本项目的词向量和CA8数据集请进行如下引用:

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

@InProceedings{P18-2023,
  author =  "Li, Shen
    and Zhao, Zhe
    and Hu, Renfen
    and Li, Wensi
    and Liu, Tao
    and Du, Xiaoyong",
  title =   "Analogical Reasoning on Chinese Morphological and Semantic Relations",
  booktitle =   "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
  year =  "2018",
  publisher =   "Association for Computational Linguistics",
  pages =   "138--143",
  location =  "Melbourne, Australia",
  url =   "http://aclweb.org/anthology/P18-2023"
}

不同领域

下列词向量基于不同的表示方式、不同的上下文特征以及不同领域的语料训练而成。

Word2vec / Skip-Gram with Negative Sampling (SGNS)
语料 上下文特征
词 + N元组 词 + 字 词 + 字 + N元组
Baidu Encyclopedia 百度百科 300d 300d 300d 300d / PWD: 5555
Wikipedia_zh 中文维基百科 300d 300d 300d 300d
People's Daily News 人民日报 300d 300d 300d 300d
Sogou News 搜狗新闻 300d 300d 300d 300d
Financial News 金融新闻 300d 300d 300d 300d
Zhihu_QA 知乎问答 300d 300d 300d 300d
Weibo 微博 300d 300d 300d 300d
Literature 文学作品 300d 300d / PWD: z5b4 300d 300d / PWD: yenb
Complete Library in Four Sections
四库全书*
300d 300d NAN NAN
Mixed-large 综合
Baidu Netdisk / Google Drive
300d
300d
300d
300d
300d
300d
300d
300d
Positive Pointwise Mutual Information (PPMI)
语料 上下文特征
词 + N元组 词 + 字 词 + 字 + N元组
Baidu Encyclopedia 百度百科 Sparse Sparse Sparse Sparse
Wikipedia_zh 中文维基百科 Sparse Sparse Sparse Sparse
People's Daily News 人民日报 Sparse Sparse Sparse Sparse
Sogou News 搜狗新闻 Sparse Sparse Sparse Sparse
Financial News 金融新闻 Sparse Sparse Sparse Sparse
Zhihu_QA 知乎问答 Sparse Sparse Sparse Sparse
Weibo 微博 Sparse Sparse Sparse Sparse
Literature 文学作品 Sparse Sparse Sparse Sparse
Complete Library in Four Sections
四库全书*
Sparse Sparse NAN NAN
Mixed-large 综合 Sparse Sparse Sparse Sparse

*由于古汉语中绝大部份词均为单字词,因此只需字向量。


语料

项目花费了大量精力来收集了来自多个领域的语料。所有的文本数据均移除了html和xml标记,仅保留了纯文本。之后采用了HanLP(v_1.5.3)对文本进行了分词。此外,我们将繁体中文用Open Chinese Convert (OpenCC)转换为了简体中文。更详细的语料信息如下所示:

语料 大小 词数量 词汇量 详情
Baidu Encyclopedia
百度百科
4.1G 745M 5422K 中文百科
https://baike.baidu.com/
Wikipedia_zh
中文维基百科
1.3G 223M 2129K 中文维基百科
https://dumps.wikimedia.org/
People's Daily News
人民日报
3.9G 668M 1664K 人民日报新闻数据(1946-2017)
http://data.people.com.cn/
Sogou News
搜狗新闻
3.7G 649M 1226K Sogou labs的新闻数据
http://www.sogou.com/labs/
Financial News
金融新闻
6.2G 1055M 2785K 从多个网站收集到的金融新闻
Zhihu_QA
知乎问答
2.1G 384M 1117K 中文问答数据
https://www.zhihu.com/
Weibo
微博
0.73G 136M 850K NLPIR Lab提供的微博数据
http://www.nlpir.org/wordpress/download/weibo.7z
Literature
文学作品
0.93G 177M 702K 8599篇现代文学作品
Mixed-large
综合
22.6G 4037M 10653K 上述所有数据的汇总
Complete Library in Four Sections
四库全书
1.5G 714M 21.8K 目前最大的古代文献汇总
上述统计结果中,所有词都被计算在内,包括低频词。

导入模型(代码)

例如我下载了多个词模型,下载得到bz2结尾的文件名,例如sgns.financial.bigram.bz2

使用方式

from gensim.models.keyedvectors import KeyedVectors

#以金融sgns.financial.bigram.bz2为例
model = KeyedVectors.load_word2vec_format('embeddings/sgns.financial.bigram.bz2', 
                                          binary=False, 
                                          unicode_errors='ignore')


model

Run

<gensim.models.keyedvectors.KeyedVectors at 0x7fe7fad79d60>

model.get_vector('投资')

Run

array([-0.084635,  0.890228, -0.23223 , -0.308985,  0.058241,  0.458777,
       -0.152547, -0.413471,  0.269701, -0.078043, -0.4155  ,  0.074735,
        0.35714 ,  0.103431,  0.601784, -0.390854,  0.814801, -0.122664,
       -1.076744,  0.516941, -0.293319, -0.310251, -0.407794,  0.003898,
       -0.210962,  0.378095, -0.345955, -0.223848,  0.700162,  0.207644,
        0.426249, -0.272832, -0.110305, -0.701062, -0.173407, -0.172121,
       -0.682592,  0.593414,  0.279591, -0.408284, -0.166693,  0.753402,
        0.037375,  0.141865, -0.246024, -0.108663, -0.225255, -0.856601,
        0.381026,  0.401248,  0.012108, -0.126305, -0.374255,  0.728795,
        0.219549, -0.354029, -0.353131,  0.064867,  0.49565 , -0.503267,
       -0.304075,  0.145036,  0.688948,  0.063382, -0.223243,  0.474251,
        0.80543 ,  0.683178,  0.118159,  0.408411, -0.020066,  0.009045,
       -0.135446, -0.069633,  0.206357,  0.482845, -0.075307,  0.06433 ,
       -0.112367,  0.011816,  0.87427 , -0.120287, -0.31036 ,  0.369985,
        0.560386, -0.215248,  0.389631,  0.042943, -0.319149,  0.951551,
       -0.335188,  0.642246, -0.55546 ,  0.322397,  0.659618, -0.213124,
        0.346696, -0.342239,  0.31479 ,  0.078533, -0.345148,  0.815577,
       -0.530134,  0.303419, -0.158916, -0.190564,  0.436046, -0.112251,
       -0.339966,  0.253645,  0.181076,  0.122875, -0.310951, -0.126253,
        1.641405,  0.357906,  0.165796,  0.398656, -0.330591,  0.20328 ,
       -0.077191, -0.421248, -0.078504, -0.734519,  0.146212,  0.535727,
        0.014134,  0.040322, -0.44809 , -0.758205, -0.151237,  0.248258,
       -0.319704,  0.656033, -0.518857,  0.932356, -1.01786 , -0.46354 ,
        0.160921, -0.243597,  0.106666, -0.03404 ,  0.010672,  0.260243,
        0.899813,  0.171735, -0.108209, -0.009843, -0.18113 ,  0.302494,
        0.187285,  0.064669, -0.502041, -0.724377, -0.294312, -0.522256,
        0.334543,  0.740455, -0.357653,  0.540747,  0.256146,  0.513839,
        1.116628, -0.626111,  0.505574,  0.089774, -0.381137, -0.282352,
       -0.457542,  0.198909,  0.313638,  0.560809,  0.25295 ,  0.878158,
       -0.289311, -0.629047,  0.011103,  0.041058, -0.291302, -0.014001,
       -0.027697, -0.445817, -0.070086,  0.159816, -0.120071,  1.280489,
       -0.108866,  0.01586 , -0.505574, -0.679772, -0.343165,  0.595633,
        0.438108, -0.364066, -0.393667,  0.442285,  0.24979 , -0.191607,
        0.425692,  0.535577, -0.480332, -0.737461,  0.588498, -0.380264,
        0.151292,  0.077519, -0.221384,  0.699436,  0.401642,  0.509026,
       -0.411141,  0.206719, -0.097051, -0.451834, -0.825617,  0.602984,
        0.2853  ,  0.46055 ,  0.96472 ,  0.322712, -0.373446,  0.207944,
        0.236688,  0.566523,  0.037644,  1.241091,  0.025682,  0.373211,
        0.097712, -0.195355,  0.264579, -0.072992, -0.121629,  0.041688,
        0.213666,  0.329652, -0.015182,  0.396307,  0.117955,  0.119577,
       -0.334761, -0.135917,  0.409983,  0.512367, -0.292204,  0.302897,
       -0.325733,  0.383173, -0.92419 , -0.377535, -0.059801, -0.606275,
       -0.240482,  0.054021, -0.581386, -0.555691,  0.158354,  0.103765,
        0.107681,  0.248877, -0.597925,  0.193332,  0.844085,  0.00584 ,
        0.041622, -0.111235,  0.617778,  0.234883, -0.09562 ,  0.408324,
       -0.107121,  0.717875,  0.674794,  0.127214, -0.178357,  0.331436,
        0.417898, -0.650833, -0.428309, -0.576132,  0.210533, -0.057879,
       -0.578397,  0.468586,  0.103365, -0.403216, -0.398776,  0.094514,
       -0.130387,  0.628187, -0.463082, -0.951649,  0.561544,  0.118903,
        0.448327, -0.171685, -0.672348,  0.069471,  0.556452, -0.335425],
      dtype=float32)

model.similar_by_key('投资')

Run

[('长期投资', 0.5135656595230103),
 ('投资规模', 0.5089880228042603),
 ('智百扬', 0.49565914273262024),
 ('投资总额', 0.4955061078071594),
 ('洛辉', 0.489188551902771),
 ('337409', 0.48917514085769653),
 ('洛盛', 0.4819018244743347),
 ('洛腾', 0.4728960692882538),
 ('394150', 0.4704836308956146),
 ('投资额', 0.4685181975364685)]

model.similar_by_key('风险')

Run

[('提示', 0.6549968123435974),
 ('经营风险', 0.6316577792167664),
 ('景气衰退', 0.544153094291687),
 ('风险分析', 0.5439289212226868),
 ('遇宏观', 0.5435716509819031),
 ('信用风险', 0.5345730185508728),
 ('承受能力', 0.5291797518730164),
 ('防范', 0.5271924138069153),
 ('系统性', 0.5178108811378479),
 ('不确定性', 0.5173759460449219)]

向量运行效果还行,感兴趣的同学也可以根据自己的数据训练word2vec模型,训练及使用的办法参照文章

豆瓣影评 | 探索词向量妙处


广而告之