class: center, middle, inverse, title-slide .title[ # 文本分析在经管研究中的应用 ] .author[ ### 邓旭东 ] .date[ ### 2022-09-09 ] ---
## 关于我 哈⼯⼤管在读博⼠, 管理信息系统专业 本硕均为⼯商管理专业,研究⽅向量化⾃我。 「公众号: ⼤邓和他的Python」 [「Python实证指标构建&⽂本分析」](https://textdata.cn/blog/management_python_course/) --- ## 认识文本 <center><img src="img/multitudes-of-content-illustration.jpeg" alt="multitudes-of-content-illustration" height="40%" width="40%"/></center> -- <center><img src="img/consumer_org_society.png" alt="consumer_org_society" height="40%" width="40%"/></center> --- ## 认识文本 斯图亚特·霍尔在《电视话语的编码和解码》提出 『编码-解码理论』 <center><img src="img/SenderReceiver.png" alt="SenderReceiver" height="60%" width="60%"/></center> -- - How text reflects its Sender? - How text impacts its Receiver? -- 使用文本做研究,首先需要明确三个 - 角色: Sender or Receiver - 方向: Reflect or Impact - 内容: Sender的意识(认知、偏好、...) vs Receiver的意识(认知、偏好、...) --- .left-column[ ## 认识文本 Berger, Jonah, Ashlee Humphreys, Stephan Ludwig, Wendy W. Moe, Oded Netzer, and David A. Schweidel. "Uniting the tribes: Using text for marketing insight." Journal of Marketing 84, no. 1 (2020): 1-25. ] .right-column[ <center><img src="img/生产与消费.png" alt="生产与消费" height="70%" width="80%"/></center> ] --- ## 人工编码与机器编码 <center><img src="img/unstructrueddata.png" alt="consumer_org_society" height="70%" width="70%"/></center> -- | | 分析方法 | 优点 | 缺点 | | :---------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------ | | 人工编码 | 质性(扎根) | 少量数据,深刻洞见。 | 难以应对大数据;<br>编码标准不统一; | | 机器编码| 词频、向量相似度、向量距离 | 标准如一;<br>适合大规模文本挖掘; | 需要破坏文本的结构,<br>丧失了部分信息量 | --- ## 机器编码-将文本转为数字或向量 - 符号法(每个词对应一个数字) - 词典(词频)法 - 词袋法、TF-IDF - 词嵌入(每个词对应一个向量) --- ## 符号法 [LIWC(Linguistic Inquiry and Word Count)](https://textdata.cn/blog/liwc_python_text_mining/) <center><img src="img/symbol-representation-1.png" alt="symbol representation" height="70%" width="80%"/></center> --- ## 符号法的应用 | 概念 | 测量方法 | | ---------------------- | ------------------------------------------------------------ | | **认真(努力)** | 测量文本中词语的个数 | | **情感** | 使用情感词典,统计文本中正面词占比 | | **可读性** | 文本中高难度(或专业性)词占比 | | **客观性** | 文本中某个值的方差,如情感<br>- A``产品不错, 包装破损, 态度很好, 综合还是推荐大家购买!`` [5, 1, 5, 4]<br>- B``产品垃圾,使用垃圾, 包装破损, 差评!!`` [1, 1, 1, 1]<br>A的方差更大,更客观 | | **相似性(政策稳定性)** | cosine(text_vector1, text_vector2) | | ... | ... | --- ## 词嵌入 词嵌入技术有 Word2Vec、Glove 在n维空间中,词语的语义以向量形式存在 <center><img src="img/embeddings-based.png" alt="embeddings-based" height="70%" width="80%"/></center> --- ## 词嵌入与认知 <center><img src="img/Concept_Words_Project.png" alt="Concept_Words_Project" height="80%" width="100%"/></center> Grand, G., Blank, I.A., Pereira, F. and Fedorenko, E., 2022. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nature Human Behaviour, pp.1-13. --- ## 技术对比 | 技术 | 技术 | 维度类比 | 任务 | 例子 | | ------------------ | --------------------------------- | ---------------- | -------------------------------------------------- | ------------------------------------------------------------ | | **符号法-字典**(词频) | 数个数 | 原子 | 统计每句话里的名词个数 | sent_num1 = 2<br>sent_num2 = 1 | | **符号法-词袋** | bag of words<br>one-hot<br>Tf-idf | 分子 | 转化为词向量, 计算两个句子相似度。 | vec1 = [1, 1, 1, 1, 1, 0]<br>vec2 = [0, 1, 0, 1, 0, 1]<br>similarity = cosine(vec1, vec2) | | **词嵌入** | word2vec、<br>glove等 | 中子、质子、电子 | 词语相似度。(语义上大小相近,方向相反; 态度、偏见) | mom = [0.2, 0.7, 0.1]<br/>dad = [0.3, 0.5, -0.2] | --- | 文献 | 定性 | 词频 | 词袋 | W2V建词典 | W2V认知变迁 | | ------------------------------------------------------------ | ---- | ---- | ---- | --------- | ----------- | | 王伟, 陈伟, 祝效国 and 王洪伟, 2016. 众筹融资成功率与语言风格的说服性--基于 Kickstarter 的实证研究. *管理世界*, (5), pp.81-98. | Y | Y | | | | | [语言具体性如何影响顾客满意度](https://textdata.cn/blog/jcr_concreteness_computation/)<br>Packard, Grant, and Jonah Berger. “How concrete language shapes customer satisfaction.” *Journal of Consumer Research* 47, no. 5 (2021): 787-806. | | Y | | | | | Wang, Quan, Beibei Li, and Param Vir Singh. "Copycats vs. original mobile apps: A machine learning copycat-detection method and empirical analysis." Information Systems Research 29, no. 2 (2018): 273-291. | | | Y | | | | [文本相似度](https://textdata.cn/blog/2019-12-08-lazy-prices/)<br>Cohen, L., Malloy, C. and Nguyen, Q., 2020. Lazy prices. *The Journal of Finance*, *75*(3), pp.1371-1415. | | | Y | | | | Kai Li, Feng Mai, Rui Shen, Xinyan Yan, [Measuring Corporate Culture Using Machine Learning](https://github.com/MS20190155/Measuring-Corporate-Culture-Using-Machine-Learning), The Review of Financial Studies, 2020 | | | Y | Y | | | 女性就职高管改变组织内性别偏见<br>Lawson, M. Asher, Ashley E. Martin, Imrul Huda, and Sandra C. Matz. "Hiring women into senior leadership positions is associated with a reduction in gender stereotypes in organizational language." *Proceedings of the National Academy of Sciences* 119, no. 9 (2022): e2026443119. <br> | | | | | Y | --- .left-column[ ## 案例1-众筹语言风格 王伟, 陈伟, 祝效国 and 王洪伟, 2016. 众筹融资成功率与语言风格的说服性--基于 Kickstarter 的实证研究. *管理世界*, (5), pp.81-98. 扎根发现风格; 共现法近义词法扩展词典 ] .right-column[ <center><img src="img/众筹-种子词.png" alt="众筹-种子词" height="30%" width="50%"/></center> <center><img src="img/众筹-流程图.png" alt="众筹-流程图" height="60%" width="80%"/></center> ] --- .left-column[ ## 案例2-山寨 vs 原创 Wang, Quan, Beibei Li, and Param Vir Singh. "Copycats vs. original mobile apps: A machine learning copycat-detection method and empirical analysis." Information Systems Research 29, no. 2 (2018): 273-291. 文档向量化, Kmeans聚类 ] .right-column[ <center><img src="img/copycat.png" alt="copycat" height="90%" width="90%"/></center> ] --- .left-column[ ## 案例3-Lazy prices文本相似性 Cohen, L., Malloy, C. and Nguyen, Q., 2020. [Lazy prices](https://textdata.cn/blog/2019-12-08-lazy-prices/). *The Journal of Finance*, *75*(3), pp.1371-1415. 文档向量化; cosine(doc_vec1, doc_vec2) ] .right-column[ <center><img src="img/lazy-prices-1.png" alt="lazy-prices-1" height="40%" width="55%"/></center> <center><img src="img/lazy-prices-2.png" alt="lazy-prices" height="47%" width="55%"/></center> ] --- #### 案例4-女性就职高管改变组织内性别刻板印象 PNAS2022 <center><img src="img/hiring_women.png" alt="短视主义" height="95%" width="95%"/></center> --- class: center, middle # Thanks! [**https://textdata.cn/**](https://textdata.cn/) **公众号: 大邓和他的Python**