文本分析在经管研究中的应用

class: center, middle, inverse, title-slide

.title[
# 文本分析在经管研究中的应用
]
.author[
### 邓旭东
]
.date[
### 2022-09-09
]

---

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(img/哈工大.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('div')
          logo.classList = 'xaringan-extra-logo'
          logo.href = null
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

## 关于我

哈⼯⼤管在读博⼠, 管理信息系统专业

本硕均为⼯商管理专业，研究⽅向量化⾃我。

「公众号: ⼤邓和他的Python」

[「Python实证指标构建&⽂本分析」](https://textdata.cn/blog/management_python_course/)

---

## 认识文本

---

## 认识文本

斯图亚特·霍尔在《电视话语的编码和解码》提出 『编码-解码理论』

- How text reflects its Sender？
- How text impacts its Receiver？

使用文本做研究，首先需要明确三个

- 角色: Sender or Receiver
- 方向: Reflect or Impact
- 内容: Sender的意识(认知、偏好、...)   vs  Receiver的意识(认知、偏好、...)

---

.left-column[

## 认识文本

Berger, Jonah, Ashlee Humphreys, Stephan Ludwig, Wendy W. Moe, Oded Netzer, and David A. Schweidel. "Uniting the tribes: Using text for marketing insight." Journal of Marketing 84, no. 1 (2020): 1-25.
]

.right-column[
<center><img src="img/生产与消费.png" alt="生产与消费" height="70%" width="80%"/></center>
]

---

## 人工编码与机器编码

|                              | 分析方法                                               | 优点                                                         | 缺点                                             |
| :---------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------ |
| 人工编码        | 质性（扎根）                                           | 少量数据，深刻洞见。           | 难以应对大数据；<br>编码标准不统一； |
| 机器编码| 词频、向量相似度、向量距离 | 标准如一;<br>适合大规模文本挖掘； | 需要破坏文本的结构，<br>丧失了部分信息量             |

---

## 机器编码-将文本转为数字或向量

- 符号法(每个词对应一个数字)

- 词典(词频)法 
   - 词袋法、TF-IDF

- 词嵌入(每个词对应一个向量)

---

## 符号法

[LIWC(Linguistic Inquiry and Word Count)](https://textdata.cn/blog/liwc_python_text_mining/)

---

## 符号法的应用

| 概念   |  测量方法     |
| ---------------------- | ------------------------------------------------------------ |
| **认真(努力)**         | 测量文本中词语的个数                                         |
| **情感**               | 使用情感词典，统计文本中正面词占比                           |
| **可读性**             | 文本中高难度(或专业性)词占比                                 |
| **客观性**             | 文本中某个值的方差，如情感<br>- A``产品不错， 包装破损， 态度很好， 综合还是推荐大家购买!`` [5, 1, 5, 4]<br>- B``产品垃圾，使用垃圾， 包装破损， 差评!!`` [1,  1,  1,  1]<br>A的方差更大，更客观 |
| **相似性(政策稳定性)** | cosine(text_vector1, text_vector2)       |
| ...                    | ...                          |

---

## 词嵌入

词嵌入技术有 Word2Vec、Glove

在n维空间中，词语的语义以向量形式存在

---

## 词嵌入与认知

Grand, G., Blank, I.A., Pereira, F. and Fedorenko, E., 2022. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nature Human Behaviour, pp.1-13.

---

## 技术对比

| 技术               | 技术                              | 维度类比         | 任务                                               | 例子                                                         |
| ------------------ | --------------------------------- | ---------------- | -------------------------------------------------- | ------------------------------------------------------------ |
| **符号法-字典**（词频） | 数个数                            | 原子             | 统计每句话里的名词个数                             | sent_num1 = 2<br>sent_num2 = 1                               |
| **符号法-词袋**         | bag of words<br>one-hot<br>Tf-idf | 分子             | 转化为词向量, 计算两个句子相似度。                 | vec1 = [1, 1, 1, 1, 1, 0]<br>vec2 = [0, 1, 0, 1, 0, 1]<br>similarity = cosine(vec1, vec2) |
| **词嵌入**         | word2vec、<br>glove等             | 中子、质子、电子 | 词语相似度。(语义上大小相近，方向相反; 态度、偏见) | mom = [0.2, 0.7, 0.1]<br/>dad   = [0.3, 0.5, -0.2]           |

---

| 文献                                                         | 定性 | 词频 | 词袋 | W2V建词典 | W2V认知变迁 |
| ------------------------------------------------------------ | ---- | ---- | ---- | --------- | ----------- |
| 王伟, 陈伟, 祝效国 and 王洪伟, 2016. 众筹融资成功率与语言风格的说服性--基于 Kickstarter 的实证研究. *管理世界*, (5), pp.81-98. | Y    | Y    |      |           |             |
| [语言具体性如何影响顾客满意度](https://textdata.cn/blog/jcr_concreteness_computation/)<br>Packard, Grant, and Jonah Berger. “How concrete language shapes customer satisfaction.” *Journal of Consumer Research* 47, no. 5 (2021): 787-806. |      | Y    |      |           |             |
| Wang, Quan, Beibei Li, and Param Vir Singh. "Copycats vs. original mobile apps: A machine learning copycat-detection method and empirical analysis." Information Systems Research 29, no. 2 (2018): 273-291. |      |      | Y    |           |             |
| [文本相似度](https://textdata.cn/blog/2019-12-08-lazy-prices/)<br>Cohen, L., Malloy, C. and Nguyen, Q., 2020. Lazy prices. *The Journal of Finance*, *75*(3), pp.1371-1415. |      |      | Y    |           |             |
| Kai Li, Feng Mai, Rui Shen, Xinyan Yan, [Measuring Corporate Culture Using Machine Learning](https://github.com/MS20190155/Measuring-Corporate-Culture-Using-Machine-Learning), The Review of Financial Studies, 2020 |      |      | Y    | Y         |             |
| 女性就职高管改变组织内性别偏见<br>Lawson, M. Asher, Ashley E. Martin, Imrul Huda, and Sandra C. Matz. "Hiring women into senior leadership positions is associated with a reduction in gender stereotypes in organizational language." *Proceedings of the National Academy of Sciences* 119, no. 9 (2022): e2026443119. <br> |      |      |      |           | Y           |
---

.left-column[
## 案例1-众筹语言风格

王伟, 陈伟, 祝效国 and 王洪伟, 2016. 众筹融资成功率与语言风格的说服性--基于 Kickstarter 的实证研究. *管理世界*, (5), pp.81-98.

扎根发现风格; 共现法近义词法扩展词典
]

.right-column[
<center><img src="img/众筹-种子词.png" alt="众筹-种子词" height="30%" width="50%"/></center>

<center><img src="img/众筹-流程图.png" alt="众筹-流程图" height="60%" width="80%"/></center>
]

---

.left-column[
## 案例2-山寨 vs 原创

Wang, Quan, Beibei Li, and Param Vir Singh. "Copycats vs. original mobile apps: A machine learning copycat-detection method and empirical analysis." Information Systems Research 29, no. 2 (2018): 273-291.

文档向量化， Kmeans聚类

]

.right-column[
<center><img src="img/copycat.png" alt="copycat" height="90%" width="90%"/></center>

]

---

.left-column[
## 案例3-Lazy prices文本相似性

Cohen, L., Malloy, C. and Nguyen, Q., 2020. [Lazy prices](https://textdata.cn/blog/2019-12-08-lazy-prices/). *The Journal of Finance*, *75*(3), pp.1371-1415.

文档向量化;

cosine(doc_vec1, doc_vec2)
]

.right-column[
<center><img src="img/lazy-prices-1.png" alt="lazy-prices-1" height="40%" width="55%"/></center>

<center><img src="img/lazy-prices-2.png" alt="lazy-prices" height="47%" width="55%"/></center>
]

---

#### 案例4-女性就职高管改变组织内性别刻板印象 PNAS2022

---
class: center, middle

# Thanks!

[**https://textdata.cn/**](https://textdata.cn/)

**公众号: 大邓和他的Python**