Blogs

PNAS | 历史文本中的语言积极性反映了动态的环境和心理因素(含Python代码)

Linguistic positivity in historical texts reflects dynamic environmental and psychological factors历史文本中的语言积极性反映了动态的环境和心理因素...

数据分析 | 使用决策树分析小红书帖子数据(含代码)

使用决策树分析小红书热门帖的特点，如何成为热帖。...

基于词嵌入技术的心理学研究: 方法及应用

词嵌入是自然语言处理的一项基础技术。其核心理念是根据大规模语料中词语和上下文的联系, 使用神经网络等机器学习算法自动提取有限维度的语义特征, 将每个词表示为一个低维稠密的数值向量(词向量), 以用于后续分析。心理学研究中, 词向量及其衍生的各种语义联系指标可用于探究人类的语义加工、认知判断、发散思维、社会偏见与刻板印象、社会与文化心理变迁等各类问题。未来, 基于词嵌入技术的心理学研究需要区分心理的内隐和外显成分, 深化拓展动态词向量和大型预训练语言模型(如 GPT、BERT)的应用, 并在时间和空间维度建立细粒度词向量数据库, 更多开展基于词嵌入的社会变迁和跨文化研究。 As a fundamental technique in natural language processing (NLP), word embedding quantifies a word as a low-dimensional, dense, and continuous numeric vector (i.e., word vector). Word embeddings can be obtained by using machine learning algorithms such as neural networks to predict the surrounding words given a word or vice versa (Word2Vec and FastText) or by predicting the probability of co-occurrence of multiple words (GloVe) in large-scale text corpora. Theoretically, the dimensions of a word vector reflect the pattern of how the word can be predicted in contexts; however, they also connote substantial semantic information of the word. Therefore, word embeddings can be used to analyze semantic meanings of text. In recent years, word embeddings have been increasingly applied to study human psychology, including human semantic processing, cognitive judgment, divergent thinking, social biases and stereotypes, and sociocultural changes at the societal or population level. Future research using word embeddings should (1) distinguish between implicit and explicit components of social cognition, (2) train fine-grained word vectors in terms of time and region to facilitate cross-temporal and cross-cultural research, and (3) apply contextualized word embeddings and large pre-trained language models such as GPT and BERT. To enhance the application of word embeddings in psychology。

EDGAR | 25年数据的预训练词向量模型

EDGAR 是美国证券交易委员会（SEC）的电子数据收集、分析和检索系统。EDGAR系统允许公众通过互联网访问公司提交给SEC的各种文件，例如注册声明、年度报告和其他披露文件。这些文件包括公司的财务信息、业务信息和其他关键信息，对于投资者和研究人员来说非常有用。金融等方向的同学，如果想用 **词嵌入** 技术开展研究，可以考虑使用这个开源的数据集。EDGAR is an electronic data collection, analysis, and retrieval system of the US Securities and Exchange Commission (SEC). The EDGAR system allows the public to access various documents submitted to the SEC by companies through the internet, such as registration statements, annual reports, and other disclosure documents. These documents include financial information, business information, and other key information of the companies, which is very useful for investors and researchers. Students in finance and related fields who want to conduct research using word embedding techniques may consider using this open-source dataset....

数据集 | 马前卒工作室睡前消息文稿汇总

一直有观看马前卒工作室睡前消息的习惯，感觉他的内容很理性，透露着马列科学社会风。引爆全网的两个话题独山县债务问题、以岭药业连花清瘟胶囊事件。 **数据可以拿来练习词频统计、词云图制作、情感分析、lda话题建模。已整理为csv文件，留给需要的人**。...