相关内容



一、介绍

使用2001-2021年的管理层讨论与分析mda数据(1.45G),训练出的中国A股市场词向量模型,词汇量789539, 模型文件650M。可广泛用于经济管理等领域概念(情感)词典的构建或扩展。

训练环境为内存256G的windows服务器(日常办公电脑内存16G居多), 2.0.0版本cntext库(该版本暂不开源,最新可获取的版本为1.8.4)。在该环境下,我也尝试使用14G的年报数据,训练了两天,跑不出结果,256G的内存基本用光了。所以cntext训练模型,适合的数据规模是1G左右。模型文件

  • mda01-21.200.6.bin
  • mda01-21.200.6.bin.vectors.npy

参数解读

  • mda01-21 使用2001-2021年度的mda数据训练
  • 200 嵌入的维度数,即每个词的向量长度是200
  • 6 词语上下文的窗口是6

为什么这样确定200和6,可以看这篇 词嵌入技术在社会科学领域进行数据挖掘常见39个FAQ汇总



二、导入模型

需要用到两个自定义函数load_w2v、expand_dictionary,源代码太长,为了提高阅读体验, 放在文末。大家记得用这两个函数前一定要先导入。点击下载本文

#先导入load_w2v、expand_dictionary函数源代码


#读取模型文件
wv = load_w2v(w2v_path='model/mda01-21.200.6.bin')
wv
Loading word2vec model...
<gensim.models.keyedvectors.KeyedVectors at 0x7fcc91900a90>

三、wv的使用

  • 查看词汇量
  • 查询某词向量
  • 查看多个词的均值向量

更多内容,建议查看下gensim库的文档

#词汇量
len(wv.index_to_key)

Run

789539

#查询某词的词向量
wv.get_vector('创新')

Run

array([ 4.34389877e+00, -4.93447453e-01,  2.17293240e-02,  1.90846980e+00,
        8.75901580e-01, -7.95542181e-01, -1.12950909e+00,  7.44228721e-01,
        7.38122821e-01,  6.42377853e-01,  3.99175316e-01,  2.17924786e+00,
        9.30410504e-01, -3.23538423e+00, -2.91860670e-01,  1.04046893e+00,
       -1.73857129e+00, -1.12141383e+00,  3.51870751e+00, -8.69141936e-01,
        4.95228887e-01,  4.80194688e-01, -3.35257435e+00,  7.16054797e-01,
        2.29016230e-01,  2.40962386e+00, -7.40825295e-01,  2.18998361e+00,
       -3.37587762e+00, -1.30376315e+00,  5.08445930e+00, -1.68504322e+00,
       -1.60081315e+00, -8.33779454e-01, -7.58818448e-01, -1.78838921e+00,
        2.44672084e+00,  2.27579999e+00, -2.52457595e+00,  1.36214256e-01,
       -3.09675723e-01, -6.98232710e-01,  1.73018420e+00, -8.05342972e-01,
       -1.70148358e-01, -2.43612671e+00, -1.23085886e-01,  2.83124876e+00,
        3.89446110e-01, -3.16048344e-03, -2.09607935e+00, -1.49788404e+00,
        8.58029604e-01, -1.26923633e+00,  1.86084434e-01,  9.13471103e-01,
        1.53111053e+00, -2.57916182e-01,  1.83742964e+00,  1.50475979e+00,
        6.84375539e-02,  2.76320624e+00,  1.02619076e+00,  9.41017449e-01,
        1.66149962e+00, -2.49254084e+00, -7.78038025e-01, -6.52620196e-01,
       -1.59455287e+00, -4.13568115e+00,  2.78383470e+00, -5.71591198e-01,
       -8.45031738e-01,  4.54110718e+00,  1.67990357e-01,  2.12474012e+00,
       -2.25404716e+00, -8.35567772e-01,  9.91619170e-01, -2.55307484e+00,
        2.39850569e+00,  7.65280128e-01,  2.64600372e+00,  2.58998632e-01,
       -6.56729996e-01, -1.55601549e+00,  1.49751770e+00,  8.47311080e-01,
       -2.05665565e+00, -1.14815748e+00,  1.97350585e+00,  1.02964830e+00,
       -3.87644440e-01, -9.38048363e-01, -2.55545706e-01, -7.02206418e-03,
       -2.94358826e+00, -7.96167493e-01,  1.59571424e-01,  1.25497723e+00,
        7.12080002e-01, -1.34656525e+00,  1.54059935e+00, -1.12930894e+00,
       -3.66737366e+00, -7.17270374e-01, -2.69604278e+00,  1.90242791e+00,
        9.33268607e-01, -4.67624277e-01,  3.51641893e+00,  5.66355512e-02,
       -1.31763351e+00,  1.53379011e+00,  2.32190108e+00, -5.21186776e-02,
        4.06406015e-01,  4.48809415e-01, -3.68958092e+00, -1.01650321e+00,
       -1.08470261e+00, -1.93710685e+00,  2.27287245e+00, -6.63952589e-01,
        1.88207674e+00, -1.20226216e+00,  1.08953261e+00,  1.32847381e+00,
        1.38213491e+00,  1.47196710e+00, -2.06643629e+00,  1.99588931e+00,
       -1.64155555e+00, -2.24964902e-01, -2.74115324e+00, -3.16747665e+00,
        1.24095821e+00, -4.10616726e-01, -3.48466903e-01,  1.38452172e+00,
       -1.45676279e+00, -3.54911834e-02, -4.73554075e-01, -4.23114252e+00,
        1.52749741e+00,  7.25808144e-01, -4.50003862e-01, -3.16014004e+00,
        2.60309219e+00, -2.11320925e+00,  3.61347020e-01,  1.73625088e+00,
        1.57609022e+00, -2.08762145e+00,  2.18810892e+00,  1.20706499e+00,
       -1.82370770e+00,  1.22358835e+00, -8.91464829e-01, -3.30527711e+00,
       -3.72515142e-01, -6.23329699e-01,  8.11975658e-01, -8.52464736e-01,
       -9.35325995e-02, -4.06904364e+00,  1.57146180e+00,  7.85030201e-02,
        1.94540334e+00,  2.13809991e+00, -1.58913553e+00, -3.81727874e-01,
       -2.08527303e+00,  5.89691937e-01,  2.55564898e-01,  2.38364622e-01,
        3.64680409e+00,  4.18930590e-01,  1.62034535e+00, -4.63252217e-02,
        5.80206394e-01,  5.55441022e-01,  1.91946900e+00, -1.89253080e+00,
        1.77489519e+00, -3.15311766e+00,  6.48138940e-01,  1.15823770e+00,
       -2.54519200e+00, -1.03516951e-01,  1.15724599e+00, -1.83681571e+00,
       -9.87860620e-01, -1.99984312e+00,  2.76547909e-01,  8.02748859e-01,
        1.99196494e+00, -1.43310416e+00, -2.03039408e+00, -7.19777197e-02],
      dtype=float32)

#查询多个词的词向量
wv.get_mean_vector(['创新', '研发'])

Ruj

array([ 0.17623448, -0.02220692, -0.01040847,  0.03616136,  0.04931263,
       -0.06220303, -0.02846557, -0.00156435,  0.04524047,  0.03185674,
        0.01104859,  0.06962118, -0.01969986, -0.10831943, -0.0524368 ,
        0.00623383, -0.04149605, -0.004912  ,  0.13154642, -0.04317038,
       -0.00407438,  0.00923527, -0.13339072,  0.01446994, -0.00153984,
        0.12378754, -0.06064663,  0.09322313, -0.07711462, -0.05880795,
        0.13697049,  0.0133168 ,  0.02769322,  0.02677607,  0.02549294,
       -0.04504526,  0.06267191,  0.02421109, -0.13401456,  0.01423616,
        0.01860182,  0.00344108,  0.04811918,  0.02748652,  0.0190251 ,
       -0.03800797,  0.01517046,  0.06439836,  0.01320594,  0.04748138,
       -0.08914943, -0.00642068,  0.01786153, -0.02472607, -0.04597819,
        0.05832303,  0.11275461, -0.0387079 ,  0.06912261,  0.05287468,
       -0.04447906,  0.10994074, -0.04371417,  0.01227543,  0.07498093,
       -0.11285575, -0.03113984, -0.01122221, -0.03913497, -0.12117577,
        0.08593786, -0.04319173, -0.01860389,  0.15636683,  0.02267851,
        0.0922839 , -0.12106322, -0.07572737,  0.0191772 , -0.00977821,
        0.00455545,  0.01378978,  0.04774487, -0.02080727,  0.01015578,
       -0.04695337,  0.0848957 , -0.01112909, -0.03210922,  0.01151857,
        0.02214565,  0.03220333, -0.02468888, -0.07493623, -0.03724978,
       -0.00716823, -0.12043905, -0.0560291 , -0.00666756,  0.03659805,
        0.0532646 , -0.05371486,  0.06905847,  0.00660356, -0.10362111,
       -0.0015829 , -0.13282564,  0.08241726,  0.00993685,  0.04208402,
        0.03087696,  0.04765649, -0.00834742,  0.07236902,  0.04473683,
       -0.02643864, -0.0050621 ,  0.04462356, -0.0832998 , -0.05533891,
        0.00664944, -0.13001585,  0.07607447, -0.00764748,  0.01410657,
       -0.03057465,  0.0250505 ,  0.09252612, -0.00784517,  0.0386237 ,
       -0.059011  ,  0.05357389, -0.04604931,  0.04388874, -0.0971131 ,
       -0.09777305,  0.02943253, -0.04103448, -0.03944859,  0.09638489,
       -0.02226706,  0.02822194, -0.0093646 , -0.11203568,  0.06142627,
        0.04761236,  0.02720375, -0.09777595,  0.04048391, -0.06758034,
       -0.01500905,  0.02439078,  0.07150253, -0.02562411,  0.02533657,
        0.00799897, -0.06416934,  0.03153701, -0.03944302, -0.04653639,
       -0.04123383, -0.01590026,  0.03051148, -0.02014856, -0.01448381,
       -0.10517117, -0.00649814,  0.02478252,  0.02855514,  0.09052269,
       -0.03505059, -0.03173327, -0.06641324,  0.06284194,  0.01993516,
        0.01349441,  0.1410133 , -0.05283241,  0.03687092, -0.02535007,
        0.00415636,  0.05841105,  0.07389537, -0.13176979,  0.06759793,
       -0.092868  ,  0.01370211,  0.06616284, -0.09137756, -0.01640504,
        0.06095972, -0.05725639, -0.04122292,  0.00598698,  0.02904861,
        0.0442962 ,  0.07399555, -0.04657119, -0.07636161,  0.03204561],
      dtype=float32)

有了每个词或者概念的向量,可以结合cntext旧版本单语言模型内的态度偏见的度量。



四、扩展词典

做词典法的文本分析,最重要的是有自己的领域词典。之前受限于技术难度,文科生的我也一直在用形容词的通用情感词典。现在依托word2vec技术, 可以加速人工构建的准确率和效率。

下面是在 mda01-21.200.6.bin 上做的词典扩展测试,函数expand_dictionary会根据种子词选取最准确的topn个词。

#短视主义词  实验
expand_dictionary(wv=wv, 
                  seedwords=['抓紧', '立刻', '月底', '年底', '年终', '争取', '力争'],
                  topn=30)

Run

['抓紧',
 '立刻',
 '月底',
 '年底',
 '年终',
 '争取',
 '力争',
 '争取',
 '力争',
 '年内',
 '月底',
 '年底',
 '尽早',
 '3月底',
 '尽快',
 '抓紧',
 '6月份',
 '4月份',
 '月份',
 '工作力争',
 '努力争取',
 '工作争取',
 '10月底',
 '年内实现',
 '年底完成',
 '中旬',
 '7月份',
 '9月底',
 '有望',
 '月底前',
 '早日',
 '全力',
 '继续',
 '月初',
 '努力',
 '确保',
 '8月份']

expand_dictionary(wv=wv, 
                  seedwords=['团结', '拼搏',  '克服',  '勇攀高峰',  '友善',  '进取'],
                  topn=30)

Run

['团结',
 '拼搏',
 '克服',
 '勇攀高峰',
 '友善',
 '进取',
 '拼搏',
 '艰苦奋斗',
 '坚定信念',
 '团结拼搏',
 '上下同心',
 '团结',
 '顽强拼搏',
 '勇于担当',
 '团结一致',
 '团结奋进',
 '精诚团结',
 '齐心协力',
 '开拓进取',
 '奋进',
 '团结一心',
 '实干',
 '同心协力',
 '团结协作',
 '锐意进取',
 '积极进取',
 '奋力拼搏',
 '拼搏精神',
 '努力拼搏',
 '进取',
 '奋发有为',
 '扎实工作',
 '同心同德',
 '拼搏进取',
 '脚踏实地',
 '励精图治']

expand_dictionary(wv=wv, 
                  seedwords=['创新', '科技',  '研发',  '技术',  '标准'],
                  topn=30)

Run

['创新',
 '科技',
 '研发',
 '技术',
 '标准',
 '创新',
 '技术创新',
 '技术研发',
 '科技创新',
 '先进技术',
 '自主创新',
 '前沿技术',
 '关键技术',
 '科研',
 '新技术',
 '创新性',
 '研发创新',
 '产品研发',
 '基础研究',
 '产品开发',
 '集成创新',
 '核心技术',
 '自主研发',
 '技术应用',
 '技术集成',
 '前沿科技',
 '技术标准',
 '工艺技术',
 '科技成果',
 '技术开发',
 '尖端技术',
 '工程技术',
 '技术相结合',
 '科学技术',
 '工艺']

expand_dictionary(wv=wv, 
                  seedwords=['竞争', '竞争力'],
                  topn=30)

Run

['竞争',
 '竞争力',
 '竞争能力',
 '竞争优势',
 '市场竞争',
 '竞',
 '市场竞争力',
 '竞争实力',
 '参与市场竞争',
 '国际竞争',
 '市场竞争能力',
 '核心竞争力',
 '激烈竞争',
 '市场竞争优势',
 '竞争态势',
 '参与竞争',
 '竞争力重要',
 '竞争对手',
 '创新能力',
 '综合竞争力',
 '价格竞争',
 '之间竞争',
 '核心竞争能力',
 '未来市场竞争',
 '国际竞争力',
 '影响力竞争力',
 '国际化竞争',
 '行业竞争',
 '综合竞争能力',
 '竞争日趋激烈',
 '产品竞争力',
 '竞争力影响力']

expand_dictionary(wv=wv, 
                  seedwords=['疫情', '扩散', '防控', '反复', '冲击'],
                  topn=30)

Run

['疫情',
 '扩散',
 '防控',
 '反复',
 '冲击',
 '蔓延',
 '疫情冲击',
 '疫情爆发',
 '新冠疫情',
 '新冠肺炎',
 '疫情蔓延',
 '疫情暴发',
 '肆虐',
 '本次疫情',
 '冲击疫情',
 '新冠病毒',
 '疫情扩散',
 '全球蔓延',
 '疫情影响',
 '病毒疫情',
 '肺炎疫情',
 '击',
 '持续蔓延',
 '疫情持续',
 '各地疫情',
 '疫情突然',
 '疫情全球',
 '疫情传播',
 '疫情反复',
 '散发',
 '变异毒株',
 '疫情导致',
 '疫情肆虐',
 '全球疫情',
 '全球新冠']

expand_dictionary(wv=wv, 
                  seedwords=['旧', '老', '后', '落后'],
                  topn=30)

Run

['旧',
 '老',
 '后',
 '落后',
 '旧',
 '老',
 '陈旧',
 '老旧',
 '淘汰',
 '高能耗',
 '低效率',
 '设备陈旧',
 '能耗高',
 '老旧设备',
 '落后工艺',
 '进行改造',
 '工艺落后',
 '技术落后',
 '翻新',
 '更新改造',
 '改造',
 '更新',
 '替换',
 '改造更新',
 '旧设备',
 '污染重',
 '淘汰一批',
 '拆除',
 '污染严重',
 '简陋',
 '产能落后',
 '相对落后',
 '产能淘汰',
 '效率低下']



五、源代码

from gensim.models import KeyedVectors
from pathlib import Path


def load_w2v(w2v_path):
    """
    Load word2vec model

    Args:
        w2v_path (str): path of word2vec model

    Returns:
        model: word2vec model
    """
    print('Loading word2vec model...')
    model = KeyedVectors.load(w2v_path)
    return model


def expand_dictionary(wv, seedwords, topn=100):
    """
    According to the seed word file, select the top n words with the most similar semantics and save them in the directory save_dir.
    
    Args:
        wv (Word2VecKeyedVectors): the word embedding model
        seedwords (list): 种子词
        topn (int, optional): Set the number of most similar words to retrieve to topn. Defaults to 100.
        save_dir (str, optional): the directory to save the candidate words. Defaults to 'Word2Vec'.
    
    Returns:
    """
    simidx_scores = []

    similars_candidate_idxs = [] #the candidate words of seedwords
    dictionary = wv.key_to_index
    seedidxs = [] #transform word to index
    for seed in seedwords:
        if seed in dictionary:
            seedidx = dictionary[seed]
            seedidxs.append(seedidx)
    for seedidx in seedidxs:
        # sims_words such as [('by', 0.99984), ('or', 0.99982), ('an', 0.99981), ('up', 0.99980)]
        sims_words = wv.similar_by_word(seedidx, topn=topn)
        #Convert words to index and store them
        similars_candidate_idxs.extend([dictionary[sim[0]] for sim in sims_words])
    similars_candidate_idxs = set(similars_candidate_idxs)
    
    for idx in similars_candidate_idxs:
        score = wv.n_similarity([idx], seedidxs)
        simidx_scores.append((idx, score))
    simidxs = [w[0] for w in sorted(simidx_scores, key=lambda k:k[1], reverse=True)]

    simwords = [str(wv.index_to_key[idx]) for idx in simidxs][:topn]

    resultwords = []
    resultwords.extend(seedwords)
    resultwords.extend(simwords)
    
    return resultwords



获取模型

模型训练不易, 为付费资源,如需使用请 点击进入跳转购买链接



期待合作

cntext目前仍在技术迭代,版本2.0.0综合了训练语言模型&多语言模型对齐, 有较大的应用价值,期待有独特文本数据集交流合作。

通过cntext2.0.0,理论上可以对文本所涉社会主体进行计算,适合企业文化、品牌印象、旅游目的地形象、国家形象等

  • 同主体不同时间段, 文本中蕴含的文化态度认知变迁,
  • 或同时间段,不同主体的大样本文本蕴含的差异性



广而告之