前几天刚刚分享 LIWC vs Python | 文本分析之词典词频法略讲(含代码),借鉴LIWC,我觉得中文也需要有社科类的中文情感词典库,如果能汇聚已发表论文中的中文情感词典,如用户生成内容UGC那样,那么中文文本分析也会变的容易。下图是LIWC用户分享词典界面。
LIWC用户分享词典
没有购买LIWC是看不到截图中的「USER-CREATED LIWC DICTIONARIES」。涉及版权,英文词典文件不作分享,一起尊重知识。
中文领域有很多发表出来的各研究领域的情感词典,如果有词典推荐,欢迎thunderhit@qq.com联系我,我可以将词典整理为cntext内置格式。
假设cntext内置词典丰富了,使用cntext做如下文本分析操作。
案例: cntext操作
cntext内置词典
import cntext as ct
#cntext版本
print('cntext版本: {}'.format(ct.__version__))
#查看cntext内置词典
ct.dict_pkl_list()
Run
'cntext版本: 1.7.1'
['DUTIR.pkl',
'HOWNET.pkl',
'sentiws.pkl',
'ChineseFinancialFormalUnformalSentiment.pkl',
'ANEW.pkl',
'LSD2015.pkl',
'NRC.pkl',
'geninqposneg.pkl',
'HuLiu.pkl',
'AFINN.pkl',
'ADV_CONJ.pkl',
'LoughranMcDonald.pkl',
'STOPWORDS.pkl',
'concreteness.pkl']
导入内置pkl词典
cntext内词典正在规范化,理想的规范词典应该含有词语列表、Desc简介和Referer参考文献三部分。例如,大连理工大学情感本体库词典DUTIR.pkl
dutir = ct.load_pkl_dict('DUTIR.pkl')
dutir
Run
{'DUTIR': {'哀': ['怀想', '治丝而棼', '伤害',...],
'好': ['进贤黜奸', '清醇', '放达', ...],
'惊': ['惊奇不已', '魂惊魄惕', '海外奇谈',...],
'惧': ['忸忸怩怩', '谈虎色变', '手忙脚乱',...],
'乐': ['百龄眉寿', '娱心', '如意',...],
'怒': ['饮恨吞声', '扬眉瞬目',...],
'恶': [出逃', '鱼肉百姓', '移天易日',...]},
'Desc': '大连理工大学情感本体库,细粒度情感词典。含七大类情绪,依次是哀, 好, 惊, 惧, 乐, 怒, 恶',
'Referer': '徐琳宏,林鸿飞,潘宇,等.情感词汇本体的构造[J]. 情报学报, 2008, 27(2): 180-185.'}
dutir返回了
- 词典数据
- Desc词典介绍
- Referer词典文献出处
用cntext做情感计算
情感分析,统计文本中某类词出现个数,使用cntext.sentiment函数即可实现。
sentiment(text, diction, lang=‘chinese’)
- text: 文本字符串
- diction: 情感词典
- lang: 语言类型,“chinese” or “english”; 默认lang=“chinese”
import cntext as ct
#自定义词典
diy_dict = {'pos': ['高兴', '快乐', '分享'],
'neg': ['难过', '悲伤'],
'adv': ['很', '特别']}
#cntext内置词典-DUTIR
dutir = ct.load_pkl_dict('DUTIR.pkl')['DUTIR']
text = '我今天得奖了,很高兴,我要将快乐分享大家。'
#使用diy_dict做情感分析
print(ct.sentiment(text=text,
diction=diy_dict,
lang='chinese')
#使用DUTIR做情感分析
print(ct.sentiment(text=text,
diction=dutir,
lang='chinese'))
Run
{'pos_num': 3,
'neg_num': 0,
'adv_num': 1,
'stopword_num': 8,
'word_num': 14,
'sentence_num': 1}
{'哀_num': 0,
'好_num': 0,
'惊_num': 0,
'惧_num': 0,
'乐_num': 2,
'怒_num': 0,
'恶_num': 0,
'stopword_num': 8,
'word_num': 14,
'sentence_num': 1}
LIWC用户分享词典
以下内容整理自LIWC网站,我添加了doi及中文翻译。由于没有阅读每个词典对应的文献,词典简介翻译可能会有差错。
以下词典仅仅是介绍,有疑惑的可以点击doi,找到对应论文进行理解。
由于版权问题,词典文件资源不作分享。
Dictionary | Desc | Author | Date | DOI |
---|---|---|---|---|
Absolutist | Measure absolutist thinking in texts (eg, always, never)衡量文本中的绝对主义思维(例如,always、never) | Al-Mosaiwi & Johnstone | 2018 | https://doi.org/10.1177/2167702617747074 |
Age_Stereotypes | Reflects eight broadly-defined stereotypes identified in past research as descriptive of older adults,such as impaired, despondent, shrew, recluse, vulnerable, golden, grandparent, conservative 反映过去研究中确定的八种广泛定义的刻板印象(用于描述老年人),例如“受损、沮丧、泼妇、隐士、脆弱、黄金、祖父母、保守” |
Jessica Remedios | 2010 | https://doi.org/10.1080/15298860903054175 |
Agitation&Dejection | Based on studies linking promotion versus prevention focus with the emotions “Agitation” and “Dejection” 基于将促进与预防重点与情绪“激动”和“沮丧”联系起来的研究 |
Johnsen et al. | 2014 | https://doi.org/10.2147/PRBM.S54947 |
Behavioral_Activation | Captures linguistic indicators of planning and participation in enjoyable activities 捕捉规划和参与愉快活动的语言指标 |
Burkhardt et al. | 2021 | https://doi.org/10.2196/28244 |
Big_Two | Measure the degree to which a person is thinking in terms of Agency/Communion. 衡量一个人在机构/交流方面的思考程度。 |
Pietraszkiewicz et al. | 2019 | https://doi.org/10.1002/ejsp.2561 |
Brand_Personality | Assesses Aaker’s five brand personality dimensions as well as 42 personality trait norms 评估 Aaker 的五个品牌个性维度以及 42 个个性特征规范 |
Opoku et al. | 2008 | https://doi.org/10.1080/08841240802100386 |
Controversial_Terms | A lexicon of terms that range in their degree of controversiality, particularly in terms of their use in the media. 具有争议程度的术语词典,特别是在媒体中的使用方面。 |
Mejova et al. | 2014 | http://arxiv.org/abs/1409.8152 |
Corporate_Social_Responsibility | Reveals four dimensions of corporate social responsibility 揭示企业社会责任的四个维度 |
Nadra Pencle & Irina Mălăescu | 2016 | https://doi.org/10.2308/jeta-51615 |
Cost_Benefit | Measures language related to perceived costs and benefits that result from a decision or behavior. 衡量与决策或行为导致的感知成本和收益相关的语言。 |
Michael McCullough | 2006 | https://doi.org/10.1037/0022-006X.74.5.887 |
Creativity&Innovation | Language describing creation and/or innovation 描述创造和/或创新的语言 |
Neufeld and Gaucher | 2017 | |
Crovitz_Innovator_Identification | Identify “innovators” and “non-innovators” using Hebert F. Crovitz’s 42 relational words 使用 Hebert F. Crovitz 的 42 个相关词识别“创新者”和“非创新者” |
Greco et al. | 2021 | https://doi.org/10.1007/s11135-020-01038-x |
extended_Moral_Foundations_Dictionary(eMFD) | The eMFD, unlike previous methods, is constructed from text annotations generated by a large sample of human coders. 与以前的方法不同,eMFD 是由大量人类编码人员生成的文本注释构成的。 |
Hopp et al. | 2021 | https://doi.org/10.3758/s13428-020-01433-0 |
Foresight | Measures the degree to which anticipation/foresight occurs. That is, words pointing to indicate where things are heading (often on the basis of recurrent behaviors). 衡量预期/预见发生的程度。 也就是说,指向事物前进方向的词语(通常基于反复出现的行为)。 |
Robert Hogenraad | 2020 | https://doi.org/10.1007/s11135-020-01071-w |
Imagination | Digital lexicon of 627 entries relative to imagination and transfiguration, i.e., words pointing to the unbelievable and whatever is beyond the real. 与想象和变形相关的 627 个条目的数字词典,即指向令人难以置信的事物和超越真实事物的词语。 |
Robert Hogenraad | 2019 | https://doi.org/10.1007/s11135-018-0813-7 |
Global_Citizen | A dictionary to assess language usage related to global citizenship 用于评估与全球公民相关的语言使用情况的词典 |
Stephen Reysen et al. | 2014 | https://doi.org/10.4018/ijcbpl.2014100101 |
Grant_Evaluation | Captures categories relevant to scientific grant review (ability, achievement, agentic, research, standout, pos eval, neg eval) 捕获与科学资助审查相关的类别(能力、成就、代理、研究、杰出、正面、负面) |
Kaatz et al. | 2015 | https://doi.org/10.1097/ACM.0000000000000442 |
Home_Perceptions | Calculates the frequency of words describing clutter, a sense of the home as unfinished, restful words, and nature words 计算描述杂乱、未完成的家感、宁静的词和自然词的频率 |
Saxbe & Repetti | 2022-01-01 | https://doi.org/10.1177/0146167209352864 |
Invective Dictionary | Use this dictionary to detect invective language in narrative |
A. T. Panter | 2022-01-01 | |
Linguistic_Category_Model | A computerized LCM analysis method 使用这本词典检测叙事中的谩骂语言 |
Yi-Tai Seih | 2017 | https://doi.org/10.1177/0261927X16657855 |
Loughran_McDonald_Financial_Sentiment | Dictionary for measuring positive and negative sentiment specifically in financial texts.This is the 2018 version of the dictionary. 专门用于衡量金融文本中正面和负面情绪的字典。这是 2018 年版的字典。 |
Loughran & McDonald | 2011 | https://doi.org/10.1111/j.1540-6261.2010.01625.x |
Masculine_and_Feminine | List of masculine and feminine words from Gaucher et al. (2011) Gaucher 等人的男性化和女性化词列表。 (2011) |
Maureen McCusker | 2011 | https://doi.org/10.1037/a0022530 |
Mindfulness | Two categories of mindfulness language describing the mindfulness state and the more encompassing “mindfulness journey” 描述正念状态的两类正念语言和更全面的“正念之旅” |
Collins et al. | 2009 | https://doi.org/10.1037/a0017579 |
Mind_Perception | Measures linguistic use of mind perception (words related to “agency” and “experience”) in naturalistic settings 在自然主义环境中测量心理感知(与“agency”和“experience”相关的词)的语言使用 |
Schweitzer & Waytz | 2020 | https://doi.org/10.1037/xge0001013 |
Moral_Foundations_v2.0 | An updated version of the Moral Foundations Dictionary that is recommended over the original by its creators. 道德词典的更新版本,由其创建者推荐。 |
Jeremy Frimer | 2019 | https://doi.org/10.1016/j.jrp.2019.103906 |
Moral_Justification | Measures variation in justification content (deontological, consequentialist, or emotive) as a function of moral foundations 衡量辩护内容(道义论、后果论或情感论)随道德基础的变化 |
Wheeler & Laham | 2016 | https://doi.org/10.1177/0146167216653374 |
Personal_Values_Dictionary | Measures the 10 Schwartz Values (and 4 higher-order value dimensions). 测量 10 个 Schwartz 值(和 4 个高阶值维度)。 |
Ponizovskiy et al. | 2020 | https://doi.org/10.1002/per.2294 |
Prosocial_Words | Calculates the density of prosocial words in anything that a person says 计算一个人所说的任何内容中亲社会词的密度 |
Jeremy Frimer | 2022-01-01 | https://doi.org/10.1073/pnas.1500355112 |
Regulatory_Mode | Locomotion and Assessment States of Goal Pursuit 目标追求的运动和评估状态 |
Dana Kanze, Mark A. Conley, and E. Tory Higgins | 2019 | https://doi.org/10.1016/j.obhdp.2019.04.002 |
Security_Language | Provides a reference for the comparative study of security-related linguistic repertoires in political texts (speeches, policy documents, etc.). 为政治文本(演讲、政策文件等)中与安全相关的语言库的比较研究提供参考。 |
Stephane Baele & Olivier Sterck | 2014 | https://doi.org/10.1111/1467-9248.12147 |
Self-Care | Measures the degree to which self-care words are used (e.g., diet, yoga) 衡量自我保健词的使用程度(例如,饮食、瑜伽) |
Xunyi Wang et al. | 2018 | https://doi.org/10.1093/jamia/ocy012 |
Stereotype_Content | A stereotype content dictionary, made using a semi-automated method, to capture the Stereotype Content Model in text 使用半自动化方法制作的刻板印象内容字典,用于捕获文本中的刻板印象内容模型 |
Nicolas et al. | 2022-01-01 | https://doi.org/10.1002/ejsp.2724 |
Stress | A dictionary used to measure psychological stress. Created based on the LIWC2007 English Dictionary. 用来测量心理压力的字典。 根据 LIWC2007 英语词典创建。 |
Wei Wang et al. | 2022-01-01 | https://doi.org/10.1111/apps.12065 |
Well_Being | Words that might indicate the presence of purpose or meaning 可能表明存在目的或意义的词 |
Ratner et al. | 2019 | https://doi.org/10.1080/10888691.2019.1659140 |
分享词典
中文领域有很多发表出来的各研究领域的情感词典,如果有词典推荐,欢迎thunderhit@qq.com联系我,我会将词典整理为cntext内置格式。